What’s an Icehouse: Icehouse 101
Share
More deployment options
Let’s start with the most pressing question the title presents: what’s an Icehouse? An Australian rock band, a brand of beer, a building where you store ice (usually an “ice house,” with a space), or a New Zealand business development center? Well, yes, yes, yes, and yes, but for the purposes of this blog post, we’re talking about a term coined by Starburst CEO Justin Borgman to describe a data lakehouse built on top of Iceberg and Trino. For those who are new to the data space, this raises a few other questions: what’s a data lakehouse, is Starburst the same as the candy company, and what’s Iceberg? Let’s take a step back, and build our way to an understanding of what these things are. Then we can discuss why the Icehouse may be a good solution for your data problems.
The data lakehouse
The data lakehouse is the amalgamation of the best parts of a data lake and a data warehouse. …and we’re going to need to break this down further.
The data warehouse
One of the oldest concepts in big data storage and analytics, the data warehouse should hopefully be a familiar concept to most readers who’ve stumbled their way into this blog. A data warehouse is a home for structured, analytics-ready data optimized for queries and business intelligence. A well-maintained, organized, centralized data warehouse that stores most of an organization’s data has long been the north star for a large organization’s data engineering team. The struggle is that structuring all of your data to fit within a warehouse is a massive and messy task. And because it generally requires your data to go through an ETL process, it can lead to duplication of data, cause delays on when new data is accessible, and limit flexibility. Perfectly maintaining a data warehouse is a never-ending, expensive, and time-intensive challenge; not maintaining a data warehouse well enough can reduce access to data or leave it entirely useless. There is still a time and place for data warehouses, but the flaws involving cost, scalability, and maintenance have been a pain point for as long as data warehouses have existed.
The data lake
A reaction to the headache that comes from maintaining a rigorous, structured data warehouse, the data lake takes the opposite approach: huck all the data into a lake. It’s what it sounds like. By storing data in its native format, you lose the headache and cost of massive ETL workloads and greatly simplify your data stack. The downside is that your data becomes a bit of a mess. In order to query a data lake lacking structure, your queries get more sophisticated and complicated, requiring more advanced data science skills and tools to manicure the data from unstructured storage to meaningful analytics and insights. You’re not getting rid of the task of reshaping the data – you’re pushing it downstream. If the shape and format of your unstructured data meanders or drifts over time, supporting and handling all the edge and legacy cases can become a headache or borderline impossible, leaving you with more of a data bog, swamp, or quagmire. If you grew up on the Chesapeake Bay, you might say it gives you a giant data algae bloom. You don’t want that.
Enter the data lakehouse
What if we took the benefits of both the data warehouse and the data lake? Maintain the flexibility of being able to store unstructured data when it makes sense, but be equally willing to apply some structure and rigor to the data that needs some extra attention? Like a data lake, a data lakehouse is intended to capture all of your data in a single, low-cost cloud object store, while the “house” part enables transactions, transformations, and restructuring of data with ACID (atomicity, consistency, isolation, and durability) properties to glean many of the benefits of a traditional data warehouse. There’s no concern of data duplication, and with some active maintenance, old data shouldn’t become unintelligible or require massively complicated queries to understand. Data lakehouses store a lot more metadata, enabling record-keeping, records of all transactions, and the power to rollback or look at snapshots of data in the past. This does introduce complexity, especially if you’re trying to build a lakehouse from scratch, which is why many companies are trying to sell lakehouse solutions to save data teams the headache.
For now, we can hopefully say you understand the key concepts of a data lakehouse. There’s more to be said on exactly what a data lakehouse is if you’re looking for more details, but for now, you can consider yourself briefed.
Iceberg
So what’s Iceberg? A floating chunk of ice known for sinking ocean liners, a luxury fashion house, or a Dutch video game publisher? Yes, yes, and yes, but we’re talking about Apache Iceberg, a data lakehouse table format. Iceberg is one of the three main lakehouse table formats (the other two are Apache Hudi and Delta Lake), and its story is built on top of the progression from data warehouses to lakes to lakehouses outlined above. Originally built at Netflix and designed from the ground up to pair with Trino (known as Presto at the time, but we’ll get back to that) as its compute engine, it was an answer to a Hive data lake where transactions were not atomic, correctness was not guaranteed, and users were afraid to change data for risk of breaking something. Even when they did change data, because Hive necessitated rewriting entire folders, writes were inefficient and painfully slow. When you can’t modify your data, change your schemas, or write over existing data, you quickly begin to realize all those downsides, and the data algae bloom rears its ugly head. So… enter manifest files, more metadata, and badabing badaboom – problem solved. Yes, that’s a gross oversimplification, but the reality is that Iceberg’s introduction proved that transactions in a lakehouse could be safe, atomicity could be guaranteed, and snapshots and table history were bonuses that came along for the ride.
Why Iceberg?
On the features front, partition evolution is a big upside, because as your data evolves, your partitions may need to, too. If you don’t know what partitions are, they’re a means of grouping similar clumps of data together so it can be read faster down the line, and they’ve been around for a while. Being able to change how data is partitioned on the fly is new, though, and allows you to adjust and improve your partitioning as your data evolves or changes. Iceberg also hides partitioning and doesn’t require users to maintain it, helping eliminate some of the complexity that would traditionally come from a data lake. You can check out the Iceberg docs for more information on that.
On top of all of that, Iceberg has a lot of momentum behind it. As an open source project with diverse vendor support, many major companies are deploying and using it, it has an extremely active community, and it seems likely to last and continue to receive updates and maintenance into the distant future.
How does Iceberg work?
Metadata and manifest files. A lot of metadata and manifest files.
Metadata files keep track of table state. Data files are stored in a table instead of in directories, and manifest files are tracked in a manifest list that stores metadata about the manifest. This blog previously mentioned that Iceberg supports “time travel” via snapshots of the table in the past, which can be accessed by a manifest list which points to manifest files that represent an older version of the table. On top of that, the format is smart and reuses manifest files when it can for files that remain constant across multiple snapshots. Otherwise, every single transaction is stored, tracked, and able to be accessed as part of a given snapshot.
There’s a ton of complexity to Iceberg working as great as it does. You can read the Iceberg spec or our blog explaining Iceberg architecture for more detailed information.
Optimize your data lake with Iceberg and Trino
Trino
Remember how I mentioned that Iceberg was built to pair with Trino as its compute engine, and said we’d get back to that? We’re getting back to that.
Trino history
Trino was originally created under the name Presto inside Facebook. Facebook’s problem was that they had a massive data lake built on top of Hive, but querying and analyzing their data lake with MapReduce jobs was not performant, especially at scale. Trino was built as query engine which could handle the scale of the data lake and enable analysts and data scientists, writing complicated data lake queries in SQL, to run those queries and get results back at interactive, seconds-to-minutes speeds, a vast improvement over submitting a job and waiting to see the results the next day. It was open sourced on launch, and it saw major uptake in the data community for its ability to process and power analytics at rapid speeds. The connector-based architecture meant that other companies and vendors who wanted to deploy Trino could hook it up to various other data sources beyond what was in use at Facebook, and companies like Netflix, Apple, Uber, LinkedIn, and Amazon did so, contributing to it as well as using it for their own data needs. Starburst, a data startup, entered the picture as a company built on selling a managed version of Presto, and it became one of the main contributors to the project.
Presto eventually forked into two versions – originally Presto and PrestoSQL, but PrestoSQL renamed to Trino a couple years later. The Trino website has a great blog detailing why this happened if you’re curious. Trino has amassed a myriad of features and performance improvements not in Presto that make it the engine of choice these days, though because rebrands are hard and renames are confusing (shoutout to everyone still using the term “Tweets”), you’ll still see it referred to as Presto in some places. Amazon EMR docs are trying their best to clear up the confusion.
What is Trino?
We can start with the big, lengthy definition: Trino is a ludicrously fast, open source, distributed, massively parallel processing, SQL query engine designed to query large data sets from many disparate data sources. The important thing in the center there is that it’s a SQL query engine. You have data, you want to query it with SQL, and Trino allows you to do that. It can do this at a massive scale, making it useful for organizations large and small who are hoping to glean insights from their data.
Trino’s architecture involves deploying a cluster with a single coordinator node and many worker nodes. The coordinator interprets SQL statements into a query, then breaks that query down into stages, breaks the stages down into tasks, and assigns those tasks to worker nodes, which in turn break tasks down into splits so they can be run in parallel. A small Trino cluster may involve a single coordinator and worker node running on the same machine, while a large Trino cluster may involve hundreds of servers each operating as a worker node. Large organizations can run many Trino clusters, using the Trino Gateway as a load balancer and proxy to make the many clusters behave like one massive cluster.
Why Trino?
While the data lakehouse storage format war has three horses in the race (and Iceberg is winning), Trino has been around long enough and is so good at what it does that it more or less stands on its own. If you have already solved the problem of where and how to store your data and now you’re looking for a query engine that can power your analytics, Trino is the answer. Apache Spark is a powerful compute tool for batch processing of data, and many organizations still use it for that, though Trino’s fault-tolerant execution mode can help compete with it on that front, while Spark cannot compete with Trino’s performance for analytics. Other established, performant analytics solutions will require you to migrate or ingest all of your data into a specific system so that you can run analytics on it, which is, to put it mildly, an expensive headache.
Back to the Icehouse
Believe it or not, “Back to the Icehouse” was the working title for Back to the Future (you shouldn’t believe it, that’s a lie). But with all of this context, you’re ready to understand what the Icehouse is: it’s Iceberg as your storage format, paired with Trino as your compute engine. Yep, that’s it. But in addition to the technologies that compose it being the best at what they do, there’s a few additional key reasons why it’s a great stack.
Close integration
It’s worth explicitly pointing out that Iceberg was built for Trino, and Trino’s ongoing development has included many specific features and improvements for Iceberg. A lot of organizations are using the Icehouse stack already, and this means that there’s a lot of demand for improvements to it. While there are other table formats and other analytics engines, it’s hard for them to compete with the traction that this specific stack has already picked up. Trino and Iceberg go together like peanut butter and jelly.
Open source
Trino and Iceberg are both open source, and they both have thriving user and developer communities that are improving them all the time, sharing knowledge, and working to make them better than ever. They’re both supported by several vendors and deployed at a number of massive tech companies, meaning that contributions come in from all over the world, from wildly different use cases, and will continue to do so for the foreseeable future. The benefits of open source are numerous and varied, but it’s hard to overstate the value that comes from the widespread adoption and continued development. Other tools (such as those in the data visualization space) see the adoption and get a lot of value out of adding integrations, so they do, growing the ecosystem. Bugs are encountered, figured out, and fixed before you even bump into them.
Optionality
With separate storage and compute solutions that are standalone, independent technologies, it’s nearly impossible to get locked into an increasingly-expensive cage where you’re dependent on a specific vendor to keep your data stack afloat. Historically, vendors in the data space have tried using proprietary systems and proprietary data models that make it difficult, costly, or even impossible to get your data out of their system. When there’s no alternative to bail out, you lose optionality and prices go up. With a stack built on free, open, and available technologies, you can’t get stuck, and you’ll always have easy alternatives. If something new comes along and supplants Trino, you can swap to it without hassle. If Hudi ends up being a better solution for your needs down the line, you’ll need to migrate to it from Iceberg, but your compute and analytics can stay the exact same.
In addition to that, because everything is open and interchangeable, vendor prices are forced to remain competitive, which benefits the buyer. It encourages additional innovation as vendors try to differentiate themselves. If you don’t want to pay a vendor to manage your Icehouse for you, you can deploy them on your own hardware yourself. Optionality is the opposite of lock-in, and it means that you don’t get stuck with a bad contract or expensive crutch that you can’t escape.
Starburst makes the Icehouse easy
The expertise and institutional knowledge built up from years of experience and development on Presto and then Trino has made Starburst the best-equipped vendor to help manage and deploy this stack for you. If you have a professional, grizzled data engineering team and the means of spinning up your own servers, point them at the Iceberg docs, the Trino docs and they’ll likely be happy to get cracking. But as said when discussing the downsides of a lakehouse – there’s a lot of complexity involved. Getting set up isn’t easy, and managing and maintaining it over time stays difficult and complex, which is why Starburst Galaxy manages it all for you.