
Data Lakes without Hadoop

Join Starburst on April 17th for the next iteration of our Live Demo Series
Jacqueline Vail
Brand Marketing Manager
Starburst
Jacqueline Vail
Brand Marketing Manager
Starburst
Hadoop architecture creates significant challenges for enterprises owing to problems with performance, maintenance, and scale. This problem is driving the need for modernization of Hadoop data stacks, leading directly to the adoption of data lakehouses as a replacement.
Using a SQL engine to power Iceberg is a highly effective solution. It provides data engineers and developers with immediate performance enhancements and cost savings without requiring a full system overhaul. This guide outlines how to upgrade your SQL engine to Starburst, harnessing the capabilities of Trino and Apache Iceberg, to transform your data architecture.
Why should you modernize? Traditional Hadoop infrastructures, while revolutionary in their time, now face several challenges:
These challenges make it imperative for organizations to seek modern, scalable, and cost-effective solutions. For example, the Hadoop Distributed File System (HDFS), used by Hadoop as a storage layer, often has performance challenges with large-scale datasets. This can often lead to increased costs. Upgrading your SQL engine from Hive to Trino represents a significant leap forward in terms of performance, scalability, and efficiency. To understand why this upgrade is so impactful, it’s essential to first explore what HDFS is and how it contrasts with modern cloud storage solutions.
Hadoop is a unified storage and compute solution. HDFS is Hadoop’s file storage system that allows you to access data from multiple sources. It is designed to store large data volumes, providing high fault tolerance and scalability. In contrast, cloud object storage solutions, like AWS S3, allow you to scale your workloads up and down without limitations. HDFS has a block architecture, which allows users to handle large data sets, but often leads to data duplication
HDFS doesn’t store a file in a single location; instead, it divides the file into blocks that are distributed across the cluster’s drives, servers, and racks. This is usually an on-prem technology that directly connects the racks and servers. Hadoop breaks down the data into blocks and distributes them to different nodes in a cluster. One Hadoop cluster consists of a coordinator server and multiple worker servers that spread data across multiple sources. This controller node includes a NameNode, DataNode, Job Tracker, and Task Tracker. This architecture allows Hadoop to process data in distributed systems. It will split large workloads into smaller processing operations and run them in parallel. MapReduce creates these massively parallel data processing applications.
In the case of hardware failure, HDFS remedies this by replicating data. It stores a replica in a local rack, another in a different rack, and a third in that second rack but in a different node, leading to unnecessary data duplication.
In contrast to cloud object storage, Hadoop cannot handle the rising demands from more complex architectures. When organizations began to adopt cloud data lakes, their storage costs rose and their data teams had to manage architectures comprising data warehouses sourcing from data lakes. Hadoop does not entirely separate compute and storage and data teams could not scale these components independently. Although HDFS could eventually support the cloud, Hadoop could not easily scale with cloud-native companies.
Cloud object storage stores data in objects, not in files, like HDFS. This significantly reduces storage costs. Data lakes created using cloud storage can scale resources to meet demand as needed. Because compute and storage are separated, users can scale each independently to ensure that resources are appropriately managed.
Hadoop was created to address the challenges of the big data era. Before Hadoop, data teams struggled to query large volumes of data that were being collected. Hadoop allowed these teams to run queries and transformations on big data. Hadoop uses MapReduce for its to execute workloads. However, users had to write these complex jobs in Java and compile them into a machine-readable form to run these queries. Although Hadoop made it easier for teams to process these datasets, data volume continued to grow and these queries quickly became slower and more complex, especially for analytical tasks.
In 2008, the Facebook data team created Hive to query data in Hadoop. Because it uses an SQL-like interface, most engineers could use Hive easily and with limited training. Users could run new queries and gain insights faster from their Hadoop architecture. Furthermore, with this SQL interface, Hive masks MapReduce’s complexity. Despite these improvements, Hive was still too slow, and consumers could only run a limited number of queries.
While HDFS is used for storing and processing data, Hive runs over this framework to provide a SQL-like interface for querying that data in a data lake. Data is queried in Hive but stored in HDFS. Although Hive was created to make it easier for all data consumers to use SQL to run queries, Hive’s configuration options were still complex for many users.
Hive does not query a data lake as quickly as newer technologies. This is due to the need to compile SQL queries into MapReduce with Java. Users have to translate from SQL to Java which creates processing overhead and limits the ability to run ad hoc queries.
Because Hadoop was originally designed for on-premise applications, it struggles to adjust to the rise of cloud computing. Scalability is one of the main challenges with an HDFS architecture for a modern data landscape. HDFS cannot scale up or down as efficiently as a data lakehouse or other cloud-native architectures.
A modern lakehouse architecture is a great alternative to a Hadoop environment running Hive. By combining the benefits of a data warehouse with a data lake, this architecture offers a solution to the Hadoop and Hive performance, cost, and scalability problems. This modern open data lakehouse combines a Trino query engine with Iceberg’s open table format to achieve a data lake’s affordability with a data warehouse’s performance.
Compared to HDFS architectures running Hive, an open data lakehouse enables richer datasets and more accessibility. A data lakehouse architecture allows for a variety of data types, structured, unstructured, and semi-structured. This means that analysts have access to more data. Building an open data lakehouse can support cross-cloud analytics at scale while democratizing secure data sharing with a single point of access and governance.
Starburst, built on the open-source Trino, offers a robust solution to these challenges. Trino is designed to process large-scale data at high speeds, making it ideal for enterprises dealing with extensive data workloads. By optimizing resource utilization and reducing the need for extensive hardware, Starburst can significantly lower operational costs. Additionally, Starburst supports a wide range of data sources and formats, enabling seamless integration with existing data ecosystems.
Starburst’s runtime replaces Hive’s runtime by utilizing the existing metastore metadata and files residing in storage. With Hadoop, scaling up requires paying for both storage and compute, leading to higher infrastructure costs. Starburst, on the other hand, connects to the storage and handles the compute process for reading files, allowing you to add more vCPU if needed. This separation of compute and storage ensures more efficient and cost-effective scaling compared to Hadoop.
Optum faced challenges with its data warehouse, which couldn’t support growing analytics demands, leading to poor user experience and restricted new workloads. By replacing its previous solution with Starburst for its Hadoop environment, Optum achieved 10x faster queries, a 30% reduction in infrastructure costs, and projected savings of $8 million
Upgrading your SQL engine to Starburst represents a pivotal step in modernizing your data infrastructure. It provides immediate ROI through improved performance, scalability, and cost-efficiency, setting the stage for further advancements in your data strategy. For data engineers and developers, this transition not only addresses current pain points but also opens up new possibilities for innovation and growth.
Embrace the future of data management with Starburst and experience the transformation firsthand. For more detailed information on upgrading to Starburst’s SQL engine and Hadoop modernization, visit our solution page, or explore our modern data lakehouse resources.
This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.
These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.
These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.
These cookies allow our website to properly function and in particular will allow you to use its more personal features.
These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.