Over a decade ago, the founders of Shopify launched a store to sell snowboards online. The existing e-commerce solutions at the time didn’t give them the tools they needed to be successful, so they built their own, and Shopify has since grown from five people to over a thousand. Today, Shopify powers e-commerce sites from entrepreneurs just getting started to some of today’s biggest brands.
Within the Production Engineering department at Shopify there is a Lakehouse team whose mission is to own the storage lifecycle of our datasets. After a period of intense scaling, we noticed that data silo’s formed characterized by a lack of interoperability between different tools. It became apparent that one of the big causes of this friction was the different table formats at play. We had hive tables, proprietary table formats stored in tools like BigQuery and the worst of them all, table formats invented by the in-house tools themselves.
If you wanted to spin up a new data tool, then you had to write adapters to be able to read and write datasets that were stored in tool specific formats. Perfectly reasonable requests from our data scientists required complex, time-consuming work on our part to fulfill. This led to slow development cycles and end-user frustration. We needed a change. We decided to adopt a single table format to simplify the interoperability within our data platform. We decided to anchor on the Apache Iceberg table format and this initiated a long journey of migrating petabytes of data. We decided to use Trino as the compute engine for this migration.
Migrating PBs of Data to Iceberg with Trino
Data migration strategies to Iceberg will look different depending on where you’re starting from. In broad strokes, if you need to organize your data differently on disk, for example you wish to change the partitioning strategy, or if the data isn’t in one of the three supported file formats (Parquet, ORC or Avro), then you will need to rewrite all your data. If your data does not need to be re-organized on disk and is already in a supported file format, then you can generate metadata files to register your dataset as an Iceberg table without rewriting data. The first class of datasets we aimed to migrate were hive tables written in JSON file format. Therefore, we had to rewrite all the data.
Trino added its Iceberg connector in 2020 and our project was growing and evolving alongside Trino’s Iceberg integration. To some extent we helped each other. As we hit blockers along the way, we’d raise the issue with the community, and the members were basically solving our issues in real time. For instance, the raw event data we were streaming in had time zone information, and there was a problem with timestamps and time zones. Starburst co founder and Trino contributor Piotr Findheisen helped solve the issue within a few weeks, and the patch was released in Trino version 364.
There were a few other temporary blockers, including a missing file problem we encountered when scaling rewrites but each one was patched in a subsequent release. Checkout the video recording of my talk at the 2022 Trino Summit in San Francisco if you’re interested in the details.
Results: Benchmarking Breakthroughs and Happy Users
When we looked at querying a few sample tables, our planning time and execution time were anywhere from 15X to 25X faster once we’d moved from Hive and JSON to Iceberg and Parquet. The results with our biggest table were actually kind of staggering.
Shopify has a public API open to developers where request details are fed into a table > 1.5 PBs. Once we moved this to Iceberg, we saw huge improvements in planning time and much more efficient use of cumulative user memory. But the difference in execution time was huge. Our average execution time dropped from 3 hours to 1.78 minutes. Let that sink in.
It’s now 1000X faster.
shopping.api query, 1.5PB, 80 workers
That change in execution time means analysts and data scientists can dramatically reduce their iteration cycle and stay in flow state longer and hopefully discover those valuable insights that will deliver value to Shopify.
Reflections and Outlook
Our results are clear evidence of the power of modern table and file formats. My advice to anyone who has data stored in JSON files: Migrate it now. Your users will greatly appreciate it. It can be really helpful to create a priority list of tables to migrate, ranking datasets that have poor performance and causing unnecessary engine stress high on your list.
The time it took from reporting Trino + Iceberg bugs to deployed fixes was very fast and demonstrated the commitment Trino has to being a leader in Iceberg adoption across the various compute engines. If we’d run into issues with another compute engine, I’m not sure they would’ve been resolved as quickly. The power of the open source Trino community is hard to quantify or measure, but I was impressed with how quickly our problems were addressed. At one point, we found a blocker that would have been quite challenging for us to resolve, so I suggested we step away for a couple weeks and go at it again with a fresh perspective. When we did, the problem had been fixed and released.
There are a few things I’d do differently if I were to do this project over, but Trino ended up working great for migrating petabytes of data to Apache Iceberg.
The results — 3 hours down to 1.78 seconds! — speak for themselves.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).