Data lakes enable the implemention of a wide range of solutions, including raw data collection, flexible data access for users, and building fast and efficient data warehouse and data lakehouses. From a data and analytics perspective, data lake solutions can act as a data staging ground, to transform raw data into a format for data analysis and reporting; and operate as something closer to a data warehouse with a built in query engine.
In a classic enterprise data warehouse architecture, these solutions represent the Extract Transform Load (ETL) system and the analytics query engine. Meanwhile, the ETL and analytics are faster if they are implemented in a data lake, in both query response time and time-to-insights, and more cost effective than with traditional technologies.
Data lake architecture cost savings
In a data lake architecture, data teams can store 10x more data and the cost of entry is often a fraction of the price of data warehouse solutions, which tend to start at a 7-figure price tag. As the adoption of data lake architecture grows in momentum, and more business units migrate workloads to leverage the unique benefits of this modern architecture, the cost will increase to accommodate the growing need of compute and storage.
Data lake analytics significantly cuts down on ETL costs
It’s possible to replace legacy ETL solutions at a lower cost and higher performance when using data lake analytics for data raw collection and transformation. ETL is a serialized activity that starts with modeling the target data in the data warehouse catalog, followed by designing data transformations to read the source data and make it conform with the target system schema.
Rideshare example: ETL vs. data lake-based solution
For example, a classic ETL process for collecting telemetry data from a rideshare system would normalize and validate all of the location fields, then lookup the driver and rider IDs before storing valid records in the data warehouse. Invalid records might be rejected or in some cases could cause the entire process to fail.
In a data lake based solution, the ride share data is first collected and stored in raw format. That data can be queried directly without any transformation or the data can be transformed after being stored. Data engineers can address exceptions and resume the transformation instead of restarting it. Users get faster access to data, and any data transformation is more efficient than when done via ETL.
Supporting high performance analytics directly on the data lake
Once data is stored in a data lake, it can be transformed and exported to a traditional data warehouse or analytics engine. Moving data to a data warehouse was a common implementation when data lakes first emerged and they were slow to query. However, modern data lakes can now support high performance query engines, enabling users direct access to both raw and transformed data directly in the data lake.
A common solution today involves loading raw data directly into a data lake, then transforming it into an optimized format that users can query directly. The latest high performance query engines used in data lakes also include technologies such as columnar data storage, compression, data partitioning, and big data indexing.
In the same ridesharing example, a data lake based solution for ride share data could feed a real time machine learning system for ride assignment and rate calculation, a real time reporting system for showing updated user reviews, and an interactive analytics system for business analytics to query. All of these can be served from the same data lake.
Online insurance platform accelerated time-to-insight
And for a proven example, an online insurance platform Assurance ran TV commercials that led to spikes in demand. Assurance needed to respond in real-time, reach out to its agents, and bring them online to assist shoppers. During one peak period, Assurance’s real-time logging tool went down. The Director of Engineering suggested moving streaming events to its data lake, and the Data Science team got to work. Within three hours, all the key data streamed into its data lake, which was connected to Tableau via Starburst.
“It was not that hard at all, and what resulted was real-time analysis powered by Starburst,” says Mitchell Posluns, Senior Data Scientist of Assurance. “The data sources were abstracted away. We didn’t have to worry about sources and formats. We just plugged in to Starburst, built this beautiful and insightful dashboard, shipped it to stakeholders, and got back to business as usual.”
Data-driven made possible with data lake solutions
Striving to become a data-driven organization is no doubt a challenge for many organizations. The solution is to provide fast, real-time access to data from different parts of the business process as it drives measurable results.