Computing history steadily progressed from expensive, proprietary technologies to more affordable, scalable open platforms. Enterprise reliance on mainframe vendors to process large amounts of data began to fade in 2006 with the release of two competing storage systems.
The Apache Hadoop Distributed File System (HDFS) lets companies process large datasets on commodity hardware running in their on-premises data centers.
Cloud-based object storage, pioneered by the Amazon Simple Storage Service (Amazon S3), lets companies leverage the scalability and capacity of third-party storage services.
This article will compare these two approaches to big data management, how enterprise analytics works with them, and how companies are modernizing their HDFS architectures by migrating to cloud-based open data lakehouses.
What is Hadoop Distributed File System (HDFS)
HDFS is a distributed file system and data store that lets companies manage terabytes of data on the commodity servers running in their data centers. One of the Hadoop framework’s four core modules, HDFS enables fault-tolerant, high-throughput data access for various use cases, including data warehouses.
What is Amazon Simple Storage Service (Amazon S3)
Amazon S3 is a cloud-based data storage service that saves data as objects in an efficient and scalable flat structure. One of the pillars of Amazon Web Services (AWS) along with Elastic Compute Cloud (EC2), S3’s object storage lets companies build entire data architectures in the cloud without the expense of an on-premises data center.
HDFS and S3 architecture
HDFS and S3 represent fundamentally different approaches to storing and managing data at scale. Understanding their origins and architectures will help explain why companies are re-evaluating their big data infrastructures to meet data’s ever-increasing volume, velocity, and variety.
HDFS architecture
Based on research conducted by Google and formalized by engineers at Yahoo!, Hadoop and HDFS addressed critical issues facing these pioneering web search engines. Websites had proliferated exponentially since the mid-1990s, forcing search companies to manage data at unprecedented scales and making conventional technology providers prohibitively expensive. As a result, companies like Yahoo! built their own data centers with more failure-prone commodity servers.
Although Google kept its big data processing systems to itself, Yahoo! handed Hadoop and HDFS to the Apache Software Foundation, letting any company implement more affordable and scalable data architectures. Several design principles guided Hadoop’s development, including:
- Prioritization of read throughput over latency.
- Fault tolerance through data replication and rapid fault detection.
- Data locality processes data where it lives to improve performance and avoid network congestion.
Hadoop modules
In addition to HDFS, the Hadoop framework consists of three other modules for storing, managing, and processing massive amounts of data:
- Hadoop Common is a set of shared utilities and libraries.
- Hadoop Yet Another Resource Negotiator (YARN) provides resource and task management.
- Hadoop MapReduce is a Java-based framework for parallelized data processing.
HDFS architecture
An HDFS cluster consists of a NameNode, which manages the file system’s namespace, and multiple DataNodes, which read and write the data. Rather than storing data as distinct files, HDFS splits it into blocks distributed across DataNodes on different servers and racks. Replicating these blocks ensures that a single server’s failure does not result in data loss.
Although HDFS distributes data blocks across multiple machines, the HDFS namespace presents users and applications with a hierarchical directory structure, much like a desktop operating system.
Modernize your data lake strategy
The fastest path from Hadoop to data lakehouse
S3 architecture
Amazon S3, like competitors Azure Blob Storage and Google Cloud Storage, is an object storage service that lets customers manage large amounts of data on a scalable, accessible cloud platform. An object consists of a globally unique identifier, subsets of a larger data file, and object-level metadata.
S3 buckets are private object collections that authorized users and applications may access. Typically, an S3 bucket is specific to an AWS Region, allowing companies to optimize for latency, cost, and regulatory compliance. Although a bucket’s objects store data in a non-hierarchical structure, object metadata lets S3 present data to users in folder-like groupings.
Amazon S3 offers multiple storage options, allowing companies to store data more cost-effectively. High-performance, low-latency options cost more but deliver the best performance for mission-critical applications. At the other extreme, inexpensive archival options for infrequently accessed data come with retrieval times measured in minutes or hours.
Data processing & querying capabilities: Spark & Trino
HDFS and S3 cannot meet modern data processing and analytics demands by themselves. Hadoop’s legacy MapReduce technology is complex, relatively inaccessible, and imposes significant latency penalties. The Apache Hive project masks MapReduce complexity by adding a SQL-like layer to make Hadoop more accessible at the expense of even more latency. On the other hand, object stores like S3 cannot match a traditional data warehouse’s rich metadata and analytics capabilities.
Open-source analytics engines like Apache Spark and Trino can access either storage system to power large-scale analytics use cases.
Apache Spark
University of California, Berkeley researchers frustrated with Hadoop’s limitations developed Spark to support their data science and machine learning projects. The Spark framework consists of four elements:
- Spark Core provides scheduling, transformation, and optimization services to the other elements.
- Spark SQL gives users an ANSI-standard SQL language for querying datasets.
- Machine Learning library (MLlib) comprises optimized algorithms for machine learning projects.
- Structured Streaming lets Spark process real-time data streams.
Spark can use HDFS and YARN to query data without relying on MapReduce. As a result, Spark processes data significantly faster than a standard Hadoop implementation. Spark can also run on Amazon S3. However, this requires a separate cluster manager like Kubernetes.
Trino
Now an open-source project, Trino began as Facebook’s Presto initiative. The social media company’s data teams needed a faster way to run interactive queries on Hadoop data warehouses.
Trino’s massively parallel processing architecture decouples compute from storage, letting data teams scale performance independently of storage capacity.
Support for ANSI-standard SQL lets data consumers at any skill level access large datasets through the business intelligence applications they already know or by integrating SQL commands in their own code.
Rather than remaining confined to the Hadoop ecosystem, Trino connectors let users query data from any enterprise data source, including Amazon S3. This federated approach unifies a company’s data architecture within a single point of access to support multiple use cases beyond analytics. For example, data engineering teams can use Trino to run batch ETL pipelines more efficiently.
Amazon EMR: Hadoop in the cloud
Hadoop’s origins in the data center make implementations relatively inflexible. For example, expanding compute and storage capacity requires infrastructure investments that take time and money to deploy. Moreover, building infrastructure to meet peak demand requires an over-investment in capacity that will go unused.
In 2009, Amazon introduced Elastic MapReduce (now simply Amazon EMR) to let customers build Hadoop infrastructure in the cloud. The EMR File System extends HDFS to let EMR store data as Amazon S3 objects rather than blocks.
EMR clusters consist of EC2 instances, called nodes, that correspond to Hadoop’s NameNode and DataNodes. Latency-insensitive processing jobs can use MapReduce and Hive. However, Amazon EMR works with Spark and Trino to deliver more performant, low-latency results.
While Amazon EMR lets companies enjoy the scalability benefits of a cloud-based Hadoop platform, it also introduces several challenges:
Complexity: Managing Hadoop and, by extension, Amazon EMR requires highly specialized skills and resources.
Cost: Moving Hadoop to the cloud does not change how it couples compute and storage, making Amazon EMR a resource-intensive way to manage data.
Optionality: Moving from a Hadoop implementation in the data center to one in AWS prevents companies from designing flexible multi-cloud architectures.
Modernizing HDFS to S3 data lakehouses
The rise of cloud-based object storage solutions led many companies to adopt the modern open data lakehouse architecture. These platforms store raw data using open file formats like Parquet and open table formats like Iceberg to create richer metadata than object storage systems provide natively.
Trino’s massively parallel processing query engine is the open data lakehouse’s final component, delivering capabilities such as:
Dynamic scalability: Decoupling compute and storage lets open data lakehouses scale dynamically with analytics workloads without impacting storage capacity.
Streamlined data management: By federating enterprise data architectures, Trino reduces the need for data movement and duplication. Less frequently accessed data can remain at the source, allowing the lakehouse to store higher-value data.
Enterprise-wide accessibility: Trino’s SQL compatibility democratizes data access and powers data-driven decision-making.
Optionality: Parquet, Iceberg, and Trino make an open data lakehouse platform agnostic so companies are not locked into Amazon S3 storage.
What do you do after you move data to AWS S3?
Starburst offers an enterprise-ready analytics solution for building open data lakehouses based on Trino’s open-source technology. Advanced features like material views, smart indexing, and universal search and discovery make data more accessible. Security and governance features let data teams define role-based data access controls to create fine-grained permissions at the row and column levels.
A Starburst-based open data lakehouse makes Amazon S3 data easier to manage. For example, after landing data on S3, Starburst will automatically create and update Iceberg reference tables. Business intelligence applications and ad hoc queries get rapid, up-to-the-minute data access without incurring expensive data migration costs.