
Data Lake Analytics for Smart, Modern Data Management

Join Starburst on April 17th for the next iteration of our Live Demo Series
Tom Nats
Director of Customer Solutions
Starburst
Tom Nats
Director of Customer Solutions
Starburst
Almost every company I speak to on a daily basis has some sort of data lake. One thing we can thank from Hadoop was companies changed their landing/staging/raw zones to their data lakes and from there, some of it goes to a data warehouse.
In reading the new report on data lakes and lakehouses from GigaOm, I was excited to see companies like ours and others continuing to innovate around providing value from a lake vs. copying the data to a data warehouse.
Much of my job lately has been around educating companies on the hidden value their data lake can provide while serving up most of their analytical use cases. There is still a lot of work to do and from my experience, there are 4 things that companies are hesitant about using their data lakes to provide a majority of their analytics:
As we see from the report, the features and abilities of the different data lake vendors are converging and have become almost equal to a data warehouse. Data lakes are transforming from swamps into very well-structured architectures with the ability to handle a wide range of use cases from large multi-terabyte queries to seconds and even sub-seconds.
The diagram below illustrates what most companies perceive as a common data flow and use cases between a data lake and a warehouse:
This report shows more and more companies are choosing to serve up their analytics using data stored in a single location due to the simplicities and cost savings:
I presented these same ideas in a dbt coalesce presentation which can be found here and slides here. This talk covers the misconceptions of data lakes and an open, multi-engine approach which differs from some of the vendors in this report.
Building an open data lake architecture allows for the greatest flexibility to future-proof your company into being locked into one engine or even storage format. It enables data users to access data in their data lake directly via SQL, simplifies complexity, and makes life easier for data teams.
Top 4 things to consider when building or repurposing your data lake to handle more use cases:
A modern open data lakehouse, with data stored in vendor-agnostic formats, is the architecture that best enables data democratization, both today and for years to come.
As data lakes and lakehouses become more mainstream, the waters will quickly become muddied with new vendors jumping in. This further validates the reason to consider enhancing your data lake to handle more use cases vs. continuing to duplicate data in a data warehouse as well as the additional expenses associated with them.