Hadoop Distributed File System (HDFS)
HDFS is one element of the Apache Software Foundation’s Hadoop distributed data processing framework. Its fault-tolerant design delivers high-throughput access to large datasets to support a range of enterprise big data applications. This introduction to HDFS explains the framework and why modern open data lakehouse architectures have surpassed it.
HDFS vs S3
HDFS and Amazon S3 are examples of two different data storage system architectures. With files organized in hierarchical directories, HDFS stores data as blocks distributed across the HDFS cluster’s servers and racks. Amazon S3, Azure Blob Storage, and Google Cloud Storage store data as objects in a flat structure, relying on file system metadata for discovery and access.
HDFS vs Hadoop
HDFS is one of the Apache Hadoop data processing framework’s four core modules. Another module is Hadoop YARN, a framework for managing Hadoop clusters and scheduling jobs. Hadoop MapReduce is Hadoop’s data processing engine. Finally, Hadoop Common is a set of shared utilities for the other modules.
HDFS vs Hive
Apache Hive is an alternative data processing engine that gives Hadoop platforms data warehouse-like capabilities. Hive operates on top of MapReduce to streamline access to data stored in HDFS.
Hadoop, HDFS, Hive, Trino, Iceberg, data lakehouse
In the early 2000s, search companies needed better ways to handle the sheer volume of pages and queries they handled. Yahoo developers used research published by Google as the starting point for what would become Hadoop and its distributed file system.
This Java-based framework streamlined big data analytics and data science processing while making it possible to manage large files on more affordable commodity computer hardware. Over time, the Hadoop ecosystem expanded to include other open-source projects like the HBase non-relational database, the Spark analytics engine, and the Kafka real-time streaming data processing platform.
Hadoop Distributed File System
The HDFS architecture differs from the local file systems people are most familiar with. Desktop operating systems were designed to store files on a single system with a single storage device. HDFS originated in data centers that can distribute files across racks with multiple servers and servers with multiple drives.
Like a desktop, HDFS uses a hierarchical file system namespace that uses a directory structure to organize files stored in an HDFS cluster. Unlike a desktop, HDFS does not store a file in one place, instead breaking it up into a series of blocks stored across the cluster’s drives, servers, and racks.
Each cluster consists of a single NameNode and at least one DataNode. The NameNode manages the cluster’s namespace, controls access to files, and points to where data blocks are stored. Each node in the cluster has a DataNode that supports read, write, creation, deletion, and replication operations.
Data replication is how HDFS provides fault tolerance in the event of hardware failure — storing a replica in a local rack, another in a different rack, and a third in that second rack but in a different node. This rack awareness lets HDFS prevent data loss should a server or even an entire rack fail.
Hive
When first released, Hadoop queries relied on MapReduce — a programming model that, although based on Java, comes with a steep learning curve. Despite Hadoop’s widespread use, MapReduce is beyond most data users’ abilities — a significant challenge for organizations putting data at the center of decision-making.
Facebook developers wanted to make Hadoop platforms more accessible by creating an SQL-like interface to MapReduce, now the Apache Hive project. Users build queries with the Hive Query Language, which is then compiled into a MapReduce execution plan.
While Hive makes Hadoop data more accessible, the added complexity increases latency and consumes more computing resources.
Modernizing queries with Trino (formerly Presto)
Presto was another Facebook project that sought to overcome Hive and Hadoop limitations by creating a better method for running interactive queries on large-scale datasets.
The project would later fork to become the popular Trino distributed query engine, delivering multiple benefits for enterprise data analytics:
Query speed: In-memory processing is significantly faster than HDFS on-storage processing.
Data access: Based on ANSI-standard SQL, Trino is more accessible to a wider range of users.
Federation: A system of connectors lets a single query return data across enterprise data sources, from Hadoop to IBM DB2 to Amazon Redshift.
Scalability: Massively parallelized queries let Trino leverage the affordable scalability of modern cloud computing services.
Iceberg’s modern table format
Although companies can implement HDFS in the cloud, this approach requires more expensive cloud storage services and does not fully leverage object storage’s efficiency and scalability. However, object stores like Amazon S3 do not provide sufficient metadata to support big data analytics.
That’s where modern table formats like Apache Iceberg come in. Iceberg provides rich metadata to speed queries and features like schema evolution, snapshot isolation, and ACID compliance to make analytics platforms more efficient and reliable.
Open data lakehouses
Combining massively parallel query engines like Trino and open table formats like Iceberg on an object storage service creates a modern open data lakehouse. This new analytics architecture combines a data lake’s affordability and scalability with a data warehouse’s performant analytics.
Open data lakehouses are particularly useful for analyzing petabytes of varied data to feed machine learning algorithms.
Richer datasets: Analysts can generate more sophisticated insights using data lakehouses because, unlike data warehouses, this architecture can store structured, unstructured, and semi-structured data.
Accessibility: Trino makes data lakehouses more accessible to anyone with SQL experience — or SQL-compatible applications. Users can conduct complex analyses without requesting help from data engineers, shortening time to insight while freeing data teams for higher-priority tasks.
Streamlined batch processing: Engineers can develop faster, more efficient batch processing pipelines on a data lakehouse.
The need for Hadoop modernization
Broad community support has allowed Hadoop and HDFS to change with the times, moving from the data center to the cloud and supporting new use cases. But can that progress continue in the face of modern enterprise data trends?
The volume, velocity, and variety of data companies must analyze to stay competitive continues to increase.
At the same time, data democratization has become essential to data-driven decision-making cultures.
However, data teams still face staffing and resource constraints that demand more efficient architectures.
Challenges with Hadoop
The core assumptions underpinning Hadoop’s framework create too much friction to deal with these trends effectively. Among the challenges companies face with their Hadoop infrastructure include:
Scalability issues: Compared to cloud-native architectures like data lakehouses, HDFS cannot scale up or down as responsively.
Administrative complexity: Managing Hadoop requires expensive expertise and significant resources compared to more modern alternatives.
Cost-effectiveness: Hadoop’s origins in on-premises data centers make it costly to run in the cloud since it can’t take full advantage of cloud object storage.
HDFS architecture to Cloud Data Lake(s)
Hadoop’s weaknesses have been apparent for some time. The first attempts to move beyond the HDFS architecture migrated data architectures to cloud data lakes. Built on commodity object storage services, data lakes could store any type of data. They decoupled storage from compute, allowing each to scale independently. However, cloud data lakes do not have a Hadoop-based data warehouse’s metadata and analytics resources. Organizations were left with expensive and complex two-tiered architectures comprising data warehouses sourcing from data lake repositories.
Modern data lakehouse architecture
Modern data lakehouse architectures deliver the benefits of both data lakes and data warehouses, providing a true alternative to legacy Hadoop clusters. A lakehouse has the affordability, dynamic scalability, and data diversity of a data lake and the analytics performance of a data warehouse.
Starburst and Trino: A gateway to modernization
Founded by the creators of Trino, Starburst is the cornerstone of an open lakehouse’s end-to-end analytics capabilities. With Starburst, your data architecture can take advantage of:
Federation: Connectors for more than fifty enterprise data sources eliminate silos and provide a more holistic view of business data.
Accelerated SQL analytics: Indexes, caching, and other performance features provide as much as a seven-fold boost to query performance while reducing compute costs by forty percent.
Security: Role-based and attribute-based access controls let you create granular rules to enforce security, privacy, and other governance policies.
Accessibility: Starburst becomes the company’s single point of access for users at all skill levels, empowering data-driven business cultures.
These and other features will streamline your migration from HDFS to an open data lakehouse. For example, federation lets you create an abstracted access layer for applications to access your legacy data sources. Once the data has migrated to your data lake, your data team simply changes the Starburst connectors for a seamless transition that is transparent to your users.