The unbundling of cloud data warehouses
Share
More deployment options
I was listening to the excellent “Data Engineering Podcast” yesterday and the episode was “Surveying The Market Of Database Products” with Tanya Bragin who is the VP of Product ClickHouse. If you haven’t heard this episode or even this podcast, you are really missing out. Tanya does a wonderful job summarizing the state of databases mostly focusing on where the analytical side is going.
One phrase Tanya said a few times in this episode was the “unbundling of cloud data warehouses”. This simply means separating storage from compute which data lakes have provided on day one. What’s changed is more and more use cases are able to be served from an open data lake by numerous engines with features such as ACID transactions, higher performance (ssd caching and indexing) and enterprise-grade security thanks to table formats like Iceberg, Delta Lake and Hudi. (RBAC, ABAC, masking,etc..) This means you aren’t stuck with a single vendor for all of your analytics which includes storage lock-in.
What Tanya was clear to say though was you should start by landing your data in a lake/object store and there will be times when you need to choose another technology to meet certain use cases. Examples of this are OLAP databases and how they have been modified to handle the constant ingestion of data while providing sub-second analytical queries to power applications and real-time ad-hoc queries. Other use cases are high-performance search and industry-specific type solutions. In the diagram below, I attempt to illustrate where things appear to be heading into 2024.
Some points:
- Previously closed architectures like Snowflake, BigQuery, etc.. are starting to get on board and supporting external customer storage.
- Data sources can be batch or real-time with more and more systems hooking right into the source systems directly to basically “mirror” their data into lake storage.
- In addition to OLAP, we have other technologies such as search and purpose-built applications like Data Dog,etc..
- There are many other lake engines but they need to support the table formats I listed above in order to be considered supporting an “open” architecture
- One gotcha is who is going to “own” your table metadata. I outlined this in a recent post which can be found here. This can be considered lock-in as well. We’ll call it “metadata lock-in”…
To sum it up, 2024 is going to be an exciting time for companies that can finally take advantage of and build more open analytical environments without the fear of lock-in. If you get some time, I highly recommend the Data Engineering Podcast and especially this episode. Tanya is wealth of information and knowledge on database technologies and I agree with her assessment of the future of database engines.
And remember..data lakes aren’t any risk to your company…just silliness.