6 Considerations for Choosing the Right Cloud Data Lake Solution

Strategy
  • Kamil Bajda-Pawlikowski

    Kamil Bajda-Pawlikowski

    Co-Founder and CTO

    Starburst

Share

Data lakes have amazing attributes. For one, it enables us to handle vast, complex datasets. Data lakes offer an up-to-date stream of data that is available for analysis at any time and for any business purpose. In fact, the data can remain in its native and granular format.

Modern data lakes can also support high performance query engines, allowing users direct access to both raw and transformed data directly in the data lake.

I would say their biggest advantage is flexibility because the data lake is not just for storage, it’s a strategic technology stack that at the most basic level consists of 3 essential layers:

    1. scalable object storage such as S3, ADLS, or GCS (petabyte to exabyte scale) to hold data;
    2. a distributed query engine with a data virtualization layer that provides access to many data sources and formats like Trino to query data; and
    3. a big data catalog such as Hive Metastore or AWS Glue to find that data and define its access policy.

Moreover, cloud data lakes are transforming the way organizations are thinking about their infrastructure and data. Enterprises today understand that cloud migration is critical for its long-term success.

According to a 2021 Accenture blog on cloud trends, worldwide end-user spend on public cloud services is forecasted to grow 18.4% in 2021, demonstrating that commitment to the public cloud is only growing. But more importantly, the report states that “the cloud is more than an efficient storage solution — it’s a unique platform for generating data and innovative solutions to leverage that data.”

Across industries, more companies are moving to the cloud, and the on-premise data lake is no exception. If you don’t have a data lake yet, the cloud should definitely be a top priority.

Cloud-based solutions offer elastic scalability, agility, typically lower total cost of ownership, increase in operation efficiency and ability to innovate rapidly.

But to get the most of your cloud data lake solution, you’ll need to deploy an analytics-ready data lake stack that enables you to turn your data into a strategic competitive advantage and achieve data lake ROI.

The analytics-ready stack requires the addition of two critical capabilities:

    1. workload observability and optimization
    2. query acceleration

These tools sit on top of your cloud data lake and query engine, enabling you to operationalize your data and serve as many use cases as possible with instantly responsive interactive queries, full cost and performance control, and minimum data-ops.

6 Ways to pick the right cloud data lake solution

Here are six critical questions to consider when choosing the best cloud data lake stack for your business.

1. Which public cloud platform? AWS, Azure, and Google Cloud Platform

Choosing the right cloud platform provider can be a daunting task, but you can’t go wrong with the big three, AWS, Azure, and Google Cloud Platform. Each with their own massively scalable object storage solution, data lake orchestration solution, and managed Spark, Trino and Hadoop services.

Let’s take a look at each of the big three:

AWS Lake Formation & S3.

AWS Lake Formation provides a wizard type interface over various pieces of the AWS ecosystem that allow organizations to easily build a data lake. The primary backend storage of an AWS data lake is its S3 storage. S3 storage is highly scalable and available and can be made redundant across a number of availability zones. S3 has three tiers (Standard, IA and Glacier), with lower storage costs and higher read/write costs depending on availability. S3 also has automatic object versioning, where each version is addressable so it can be retrieved at any time. AWS S3 storage offers rich functionality, it’s been around the longest, and many applications have been developed to run on it.

Azure Data Lake & Blob Storage.

Azure Data Lake is centered around its storage capacity, with Azure blob storage being the equivalent to Amazon S3 storage. It offers three classes of storage (Hot, Cool and Archive) that differ mainly in price, with lower storage cost but additional read and write costs for data that is infrequently or rarely accessed. Azure Data Lakes rely heavily on the Hadoop architecture. Additionally, Azure Blob Storage can be integrated with Azure Search, allowing to search the contents of stored documents including PDF, Word, PowerPoint and Excel. Although Azure provides some level of versioning by allowing users to snapshot blobs, unlike AWS, it is not automatic.

Google Cloud Storage.

Google Cloud Storage is the backend storage mechanism driving data lakes built on Google Cloud Platform. As with other cloud vendors, Google Cloud Storage is divided into tiers (Standard, Durable Reduced Availability, and Nearline) by availability and access time (with less accessible storage being much cheaper). Like AWS, Google supports automatic object versioning.

2. Which open-source query engine and data virtualization technology?

There are many open source and commercial tools available to choose from. The most popular ones are:

Trino (formerly known as PrestoSQL).

Originally built at Facebook, Trino is a distributed query engine built over ANSI SQL that works with many BI tools and is capable of querying petabytes of data.

While Presto was built to solve for speed and cost-efficiency of data access at a massive scale, Trino has been expanded by Presto’s founders to accommodate a much broader variety of customers and analytics use cases.

Trino is user-friendly with good performance, high interoperability, and a strong community. You can access data from various data sources within a single query, combine data from multiple sources, support many data stores and data formats, and have many connectors including Hive/Delta/Iceberg, Elasticsearch, Postgres, MySQL, Kafka etc.

Apache Drill.

Drill is an open-source distributed query engine that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is also available as an infrastructure service called Google BigQuery. Drill uses Apache Arrow for in-memory computations and Calcite for query parsing and optimization. Drill shares some of the same features with Trino, including support for many data stores, nested data, and rapidly evolving structures — but it has never enjoyed wide adoption, mainly because of its inherent performance and concurrency limitations.

Spark.

Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. SQL is supported by using Spark SQL. Queries are executed by a distributed in-memory computation engine on top of structured and semi-structured data sets. Spark works with multiple data formats but is more general in its application and supports a wide range of workloads such as data transformation, ML, batch queries, iterative algorithms, streaming etc. Spark has seen less adoption for interactive queries than Trino or Drill.

3. Your cloud data lake, or a managed solution?

With managed analytics services, enterprises can start using data lake analytics quickly — and let their third-party provider manage the storing and managing of data. A managed solution also enables users throughout the company to quickly run unlimited queries without having to wait on the DevOps team to allocate resources. However, as adoption and query volume grow, spending balloons dramatically.

However, every managed analytics solution becomes another data silo to manage. Unified access control, audit trails, data lineage, discovery and data lake governance become complex, requiring custom integrations and vendor lock-in. That’s the double edged sword of “quick-start” managed solutions that CIOs need to be aware of, so they can prepare to shift to more economical in-house managed DataOps programs for the cost and control advantages they offer in the long term.

4. Which use cases are served best by the cloud data lake analytics stack?

The cloud data lake analytics stack can be used for a wide range of analytics use cases:

Ad hoc querying directly on the data lake without the need for transformation

    • Reporting and dashboarding for self-service BI
    • A/B testing and marketing analytics with results in hours instead of days
    • Customer-facing reports, dashboards, and interactive analytics in your own applications with low latency and hundreds of highly available concurrent queries
    • Federated querying across multiple data sources including databases, data lakes, lakehouses, on-premises or in the cloud
    • Advanced analytics support for data scientists and analysts to provision and experiment with data
    • Near real-time analysis of streaming and IoT data
    • The security data lake serves as a modern and agile alternative for traditional SIEM/SOC platforms and enables rapid anomaly and threat detection over massive amounts of data

The cloud data lake analytics stack dramatically improves speed for ad hoc queries, dashboards and reports. It enables you to operationalize all your data and run existing BI tools on lower-cost data lakes without compromising performance or data quality, while avoiding costly delays when adding new data sources and reports.

5. How to serve various use cases on the cloud data lake and avoid data silos?

In order to serve as many use cases as possible and shift your workloads to the cloud data lake, you need to avoid data silos. You’ll also need to make sure your stack is analytics-ready with workload observability and acceleration capabilities, so it can be easily integrated with niche analytics technologies such as text analytics for folder and log analysis.

A solution with integrated text analytics can be used by data teams to run text search at petabyte scale directly on the data lake for marketing, IT, and cybersecurity use cases (and more). Traditional text analytics platforms were not designed to handle such specific tasks and often considered as “needle in a haystack” at a petabyte scale.

6. How to achieve the optimal query performance & price balance?

The agility and flexibility benefits of the cloud data lake are clear. But delivering performance and cost are the critical driving forces behind the massive adoption of data lakes. As analytics use cases grow across every business unit, data teams will continue to struggle while balancing performance and costs.

Manual query prioritization and performance optimization are time consuming and not scalable and often result in heavy DataOps. To expand the open data lake concept across the entire organization, data teams should seek a smart and dynamic solution that will autonomously accelerate queries using advanced techniques such as micro-partitioning or dynamic indexing.

Get your cloud data lake analytics-ready with Starburst

Starburst is an autonomous query acceleration platform which gives users control over the performance and cost of their cloud data lake analytics. Starburst delivers high ROI by leveraging dynamic & adaptive indexing and caching, efficient scan and predicate pushdown implementation, and optimized dynamic filtering implementation to accelerate SQL queries by an order of magnitude vs. other data lake query engines. Starburst autonomously and continuously learns and adapts to the users, the queries they’re running, and the data being used. Observability gives DataOps teams an open view to see how data is being used across the entire organization, and better focus data ops resources on business priorities.

With Starburst, data teams and users no longer need to compromise on performance in order to achieve agility and fast cost effectiveness. Now is the time to migrate your analytics workloads to your cloud data lake. Chances are, your competition is already there.