Data Lakes without Hadoop

Strategy
  • Shaun Bruno

    Shaun Bruno

    Marketing

    Starburst

Share

It seems like migrating to the cloud has dominated the news and a lot of companies are shuttering their data centers and letting cloud providers handle it for them. Reasons such as elasticity, simplicity, and infrastructure agility are all great reasons but there are many companies that continue to host their own infrastructure. The reasons could be security or they believe the cloud doesn’t provide the cost benefits in their scenario.

For these companies, building a data lake usually means setting up a Hadoop cluster and choosing a vendor to support it (although this is becoming less of need as it used to be.) Organizations like the idea of a company-wide object store which can store a variety of data including structured and unstructured data. There are a variety of companies that offer object S3 compatible storage software which can be installed anywhere.

One of the advantages of deploying your own object store is you get to use your own storage. This could be storage that you already own or the chance to build a new cluster using commodity servers which combine into a large storage pool. Since most of these storage engines support Amazon’s S3 protocol, they work seamlessly with Starburst and allow you to query data directly out of your data lakes.

In this blog post, we’ll aim to understand foundational data lake solutions and how Starburst can help.

How the rise of cloud computing disrupted Hadoop’s dominance with object storage

Sure, Hadoop was able to process large amounts of raw data using distributed systems. At an architectural level, these early systems used the Hadoop Distributed File System (HDFS) to store their data in large, on-premises installations. 

Over time, the rise of cloud computing disrupted Hadoop’s dominance, replacing it with object storage. Cloud object storage allowed for much greater separation of both compute and storage on a scale impossible before. At the same time, costs for cloud object storage were much, much lower. 

This began a shift in data lakes from exclusive use of HDFS towards the predominant use of distributed object storage. This sparked further developments in adjoining technologies to make better use of object storage. This is particularly true of query engines as cloud object storage requires a separate query engine to run. Starburst is designed to use both object storage and HDFS as needed. 

Currently, the three largest providers of cloud data lake storage services include: Amazon S3 (AWS), Microsoft Azure Blob Storage/Azure Data Lake, and Google Cloud Storage. 

The Hadoop framework brought the ability to distribute large computing jobs using parallel processing. With the advent of cloud-based object storage, a technological revolution was under way.  

The emergence of Hive

But there was a problem. Hadoop was complex, especially for analytical tasks. Creating MapReduce jobs required an intricate knowledge of Java that many users lacked. This gap would give birth to a new technology, Hive, which enabled users to interact with Hadoop by controlling MapReduce using SQL syntax. This was a game changing step as it opened up data lake analytics to a new audience and helped drive its adoption. 

Most data lakes are built on Hadoop, a distributed file system that can store vast amounts of data. Hadoop is designed to be scalable and fault-tolerant, meaning it can keep working even if some of the system’s servers fail, making it an ideal platform for data lakes.

When you build a data lake on Hadoop, you can use any number of technologies to access the data. You can use SQL-based tools like open source Trino, Hive, or Impala to run queries against the data. Or you can use Hadoop’s MapReduce framework to process and analyze the data.

An alternative to HiveQL

Hive was built on top of HDFS to provide SQL-like query functionality. This approach had many limitations owing to the compilation process needed to turn HiveQL into MapReduce. Starburst presents an alternative approach to HiveQL. 

Starburst query engine conforms to the ANSI SQL standard. It allows for a platform-independent, single source of access for data from any data source. Data can be housed in data lakes, data warehouses, or databases. Queries can be federated across multiple sources, providing a best-of-all worlds approach. 

For example, transactional data is often best served in a database, as those systems are designed to act as systems or record. At the same time, structured analytical data may still be processed in a data warehouse. Data lakes excel at semi-structured and unstructured data analytics. With Starburst, all of these systems can work together in a single query engine. 

Starburst also offers superior performance when compared to other technologies. This is achieved by deploying a Massively Parallel Processing (MPP) architecture that is able to leverage the combined processing power of large clusters to achieve superior processing speeds. 

Finally, by facilitating the storage options most suitable to a given use case, costs can be reduced when compared to other techniques. Highly-structured data can be retained in data warehouses, while unstructured data can be held in a less expensive data lake without sacrificing access. At the same time, the ability to scale compute resources to meet a number of different needs helps save costs in another way. 

Don’t just take our word for it, here’s what a Starburst customer, Comcast had to say, “When end users are going into on-prem or cloud environments, they will be presented with all the data sets they have access to, irrespective of where the data is located. This offered huge value to our end users.”