Data Engineering: Introducing Delta Lake

How we introduced the new technology into our data infrastructure explained by Data Engineer, Laura Lehoczki.

07/19/2022

The brief but exciting history of data engineering at Choco started a little over two years ago when Dávid Siklósi–then Senior Data Engineer, now the Engineering Manager of the Data Engineering and Data Science teams–came on board. At the time Choco was a fast-growing company with small amounts but just as fast-growing data, Laura Lehoczki explains.

Defining our need

The task of the then two-person Data Engineering team was to build an infrastructure that was going to be scalable as our users, employees, and amount of data grow.

Planning for scalability and using current industry best practices for big data, they decided to implement a data lake, as opposed to a data warehouse, hosted on S3. S3 is a great way to store huge amounts of transactional data as it offers simple file storage that is queryable with AWS’s query service, Athena. Files stored in the parquet format–what we also use at Choco–are optimized for querying, so reading the data is fast and efficient.

One arguable drawback of this approach is that data is difficult to update. Coming from an RDBMS, mostly Postgres background, I always took updates and upserts for granted. But with object storage, you can’t update a record without opening and overwriting entire files. If you need to update, you’re left with having to delete and rewrite your data, or some partitions of it. This is obviously expensive and not scalable.

Introducing Delta Lake as a solution

There is another solution though, that combines object-based storage with the ability to easily make updates: Delta Lake.

According to Delta’s official website, Delta is a storage layer that can be updated. Delta Lake provides ACID transactions through a log that is associated with each Delta table created in your data lake. This log records the history of everything that was ever done to that data table or data set.

Delta Lake is an open-source project, driven to a great extent by and often used in combination with the paid software Databricks, but we implemented our own solution because the free version was sufficient to try it out on our Pyspark applications.

How we integrated Delta Lake

The main objective of our team was providing a single source of truth for analytical purposes, and with the introduction of more complex order logic to our application, moving towards Delta seemed to be the obvious choice. Just to mention a few use cases, orders now have delivery checks and status updates, and they can be modified by the operations team.

We’d already been using Spark for our ETL jobs, which is a distributed framework for processing large amounts of data. Along with the physical and logical configuration of the hardware jobs run on, Spark configurations also include configuration of where the data catalog is–in our case Glue, another AWS service providing metadata storage for Athena tables.

The effort of introducing Delta in our ETL pipelines included adding a new class that builds on our existing FactUpdater, but writes out the result and creates Glue tables differently. The new class also deduplicates source data and creates the new partitions in Glue.

During development, we faced several issues with exceeding AWS API limitations when creating too many Glue partitions and queries slowing down significantly on delta tables because of the many files Delta generates in the same partition during an update. The first issue was solved by simply using batch updates instead of single API calls. For compacting the data, we decided to repartition the data and run the vacuum function of delta on the tables once a day, which removes obsolete files from S3.

Seeing the impact

So far we have migrated four tables pertaining to orders data to delta format. We can now update data based on unique IDs instead of deleting and rewriting files for an arbitrary time frame. The reason we introduced delta in these order tables first is that faster updates and easier querying bring the most business value there.

As an example, orders were previously processed in three different tables based on what updates happened on them. An order was first put in the order_created table, optionally in the order_tested table when it was marked as test, and lastly, it would be added to the order_confirmed table when it was confirmed by the supplier. Therefore, when querying order data, users had to join three tables to get the most up-to-date status of an order.

With the introduction of Delta, we introduced the order_modified table, which now contains all of this information in one place. Each update can be made on the individual record. This saves us a lot of redundant reads and writes, data duplication, and joins, especially since updates on orders are more and more frequent with the introduction of the delivery check and group ordering features.

Tying it all together

In conclusion, the introduction of the Delta format allowed us to simplify and speed up our pipelines in just a few weeks. It's also made querying easier for our data lake users including business and market intelligence analysts who create insights for our Operations and Leadership teams, and our backend engineers who backtrack issues with order transmissions.

As often is the case, the migration effort upfront was definitely worth it, I can recommend making the switch to anyone currently using S3 for storing a growing amount of data.

Are you a developer looking for a new role that challenges your skills and inspires you to grow? You might be exactly what we’re looking for. Check out our open positions here.