“I got, I got, I got, I got options” – Pitbull
It’s the 1980s and computers are revolutionizing industries of all shapes and sizes. A database company shows you the promise of relational database management systems for organizing and analyzing all of your data that used to be stored in flat files, or even worse, on paper! You can deploy an RDBMS in each department of your company and start to digitize your processes. “I’d like to buy a dozen of those please, one for each department,” you say.
A few years later, a different vendor comes along and shows you the promise of an enterprise data warehouse. You can now take all of those databases and consolidate them in one place – a single source of truth. Even better, it comes with its own hardware so you can just plug it in. “Sold!” you yell.
Then in the late 1990s and your team starts deploying open source databases for supporting web applications. “It’s free, this is great!”
The mid 2000s arrive and you have a new problem: You’re now collecting weblog data from your website and the volume is getting so large that you are struggling to analyze it on one physical machine. Fortunately, vendors introduce you to a new class of database systems driven by massively parallel processing (MPP). You immediately buy one for your rapidly scaling needs. “We’re doing ‘Big Data,’” you explain to your colleagues.
It’s now the 2010s and you have a new problem. Those MPP databases could scale to some degree, but the costs are scaling as fast as the data and have now grown out of control. Furthermore, those databases lack the flexibility required for modern analytics and data science. So you turn to yet another new technology: Hadoop. You buy 300 servers and introduce the data lake concept to your enterprise.
That brings us to today…
It’s 2023 and you’re now moving to a cloud model, where the separation of storage and compute provides tremendous flexibility for controlling both costs and performance. Store your data in the cloud in an object storage data lake, and just spin up and down your compute resources as needed. Is Nirvana finally here?
It certainly might be, but there is an obstacle. You still have database systems from every era of database history. Your data center might as well be a museum.
How are you going to get all of your data into your new cloud data lake without disrupting your analysts and your important business applications?
Furthermore, having been burned every few years by some new technology trend that forces an expensive and time-consuming data migration effort, how do you know there won’t be some new architecture that is en vogue a few years from now? How can you break this vicious cycle of being locked into platforms that are so quickly obsoleted?
I want to say one word to you. Just one word.
Optionality
op·tion·al·i·ty
noun
The potential for options.
How do we create optionality in enterprise data architectures?
Don’t Let History Repeat Itself
3 keys to a data architecture that can stand the test of time
- Embrace Storage and Compute Separation
- Use Open Data Formats
- Future-Proof Your Architecture with Abstraction
1. Embrace Storage and Compute Separation
In the old days, when you purchased servers to deploy a new database system in your data center, you were more than likely working with a “shared-nothing” architecture. Each server filled with its own hard disk, memory, and CPU. Data is stored on the disk that is local to the memory and CPU within that server. This has some performance advantages because you can minimize latency of moving data over a network, but it has a huge drawback: You have to buy hardware that is large enough to accommodate the maximum peak usage requirements that you anticipate. Therefore, you ALWAYS have to overbuy hardware and you end up with suboptimal utilization of your expensive resources.
By contrast, the cloud model introduces storage and compute separation and the promise of a new level of cost and performance efficiency.
Now, you can store all of your data in an object storage data lake, and simply scale up and down your compute as needed. This means you:
- Only pay for what you use
- Have complete control over turning the dial on cost/performance
- Minimize data duplication
- Eliminate data loading
- Use multiple platforms to access the same data
2. Use Open Data Formats
Hadoop may be on its way out, but one of its lasting contributions is certainly the development of open data formats optimized for high performance analytics. In particular, I’m talking about:
- ORCFile
- Parquet
- Avro
These three file formats have changed the world forever. Columnar and read-optimized by design, these file formats allow you to store data in a file system (like HDFS) or object storage (like S3) and still benefit from incredibly fast performance when performing analytics via SQL.
Traditional RDBMS use proprietary storage formats, which means it’s a heavy lift to get your data out of them. This is the whole reason your cloud transformation strategy is “complicated.” Don’t repeat this mistake. This is your chance to start fresh and never fall victim to vendor lock-in again. After all, data lock-in is the worst kind of lock-in.
Open source file formats set you free.
3. Future-Proof your Architecture with Abstraction
If you are a cloud-native startup or an early pioneer in the shift to the cloud (read more in our FINRA case study), steps one and two might be enough to put you in great position for the future. However, most enterprises are not in that nirvana state just yet.
More likely, you still have one of everything and you’re trying to figure out how you are going to pull off this enormous shift to the cloud without disrupting your business analyst community that depends on uninterrupted access to their data. The truth is, your user community doesn’t care where the data lives. In fact, they’d be happier if they didn’t even have to know.
Step three provides you with a bridge from the current state to the desired state by deploying some kind of an abstraction layer between your users and the data. There are a variety of terms for this: query federation, data virtualization, semantic layer, query fabric, or my personal favorite, the consumption layer.
Regardless of what you call it, the idea is to create a layer that takes a SQL query as input (from a BI tool, CLI, ODBC/JDBC, etc.) and handles the execution of that query as fast as possible. This layer should be highly scalable and MPP in design. It should be able to push down predicates and filtering, and only bring into memory what’s needed to perform the join.
With this design, your analysts can access data anywhere, without any ETL or data movement required. Even better, they don’t even have to know if the data is in Teradata, Hadoop, or S3. They just get results, which is all they really care about.
With your users happy, you now have the freedom to migrate data from on-prem to cloud or proprietary database to data lake, at your own pace. You’ve successfully decoupled the user experience from where the data lives.
“If you ‘have optionality,’ you don’t have much need for what is commonly called intelligence, knowledge, insight, skills, and these complicated things that take place in our brain cells. For you don’t have to be right that often.”
-Nassim Talib
I hope you enjoyed this blog post. If you have any questions or comments please contact us. Or if you want to learn more about an amazing technology that accomplishes all of the above, check out Starburst Enterprise here.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).