×
×

Apache Hadoop (HDFS)

Apache Hadoop or Hadoop Distributed File System (HDFS) is an open-source framework for distributed storage and processing of large datasets

Its development was critical to the emergence of data lakes, and its wide-spread adoption helped drive the rise of big data as we know it today. Although use of Hadoop-only installations has now been superseded by newer technologies, some of these have been built on top of Hadoop’s underlying framework. For instance, Hive is built entirely on top of Hadoop. Other technologies make use of Hadoop in other ways. For this reason, to understand data lakes, it is important to understand Hadoop.

This article will introduce Apache Hadoop and explain how its capabilities were critical to what would become today’s modern data lakes. We will also consider how an open data lakehouse analytics platform can address some of Hadoop’s limitations.

What is Apache Hadoop or Hadoop Distributed File System (HDFS)?

The Apache Hadoop framework is an Apache Software Foundation open-source software project that brings big data processing and storage with high availability to commodity hardware. By creating a cost-effective yet high-performance solution for big data workloads, Hadoop led to today’s data lake architecture.

History of Hadoop

The history of modern data lakes was made possible by the rise of Hadoop. As the need to store and process extremely large datasets increased, existing platforms and tools were unable to meet demand. Data was changing, and the data lake, based on Hadoop, was about to play a central role in its evolution.

Hadoop features

At its core, Hadoop allows complex queries to be executed in parallel using multiple processing units. This helps to increase efficiency by radically expanding the scale of available processing power.

Hadoop includes a number of key features, allowing users to:

  • Collect large amounts of raw data in varying structures.
  • Store this data in a single storage repository, creating a data lake for the first time.
  • Easily scale storage technology at a low cost.
  • Query data using a processing engine, completing complex operations using multiple nodes.

Apache Hadoop architecture

Parallel processing

One of the key features of Hadoop, and what makes it ideal for data lake usage, is its ability to store and process datasets in large, distributed systems. To achieve this, the system makes use of a processing technique known as parallel processing. In essence, parallel processing takes large workloads, splits them into smaller processing operations, and processes the workload in parallel.

Related reading: Foundations of parallel processing

Apache Hadoop structure

Importantly, Hadoop offers both a storage and compute solution. To achieve this, the system is composed of several parts. The two most important of these are:

1. Hadoop Distributed File System (HDFS)

Files are stored in Hadoop using a file system known as HDFS. It is designed to provide an efficient, reliable, and resilient way of accessing large volumes of data across multiple sources.

2. MapReduce

Analysis is performed in Hadoop using a computational system known as MapReduce. This Java-based framework is capable of executing workloads on data stored on HDFS. You can think of HDFS as the storage, and MapReduce as the compute. To execute queries, complex MapReduce jobs had to be written in Java and compiled into a machine-readable form.

What are the main components of Apache Hadoop ecosystem?

Big data analytics architectures depend on the optimization of data storage and query performance. Vendors of monolithic data warehouses and relational databases couple storage and compute together in expensive pricing structures.

Hadoop’s developers hoped to create a framework for managing large datasets more affordably. First designed to run on a data center’s commodity hardware, Hadoop can use virtual machines to store data on cloud services like Azure Blob Storage or Amazon S3. Reviewing the project’s design assumptions will help understand the software framework’s structure and ecosystem.

Apache Hadoop design assumptions

Six assumptions guided Hadoop’s development and enabled its widespread adoption as a foundation for enterprise analytics architectures.

Hardware failure is the norm: At scale, Hadoop will run on hundreds or thousands of servers, any of which could fail at any time, requiring rapid fault detection and recovery.

Applications need streaming access to datasets: Unlike the Windows operating system, Hadoop’s batch processing emphasizes high throughput over low latency.

Applications have large datasets: Hadoop can handle millions of files that can scale into the petabyte range.

Applications need a write-once-read-many file access model: Hadoop only appends or truncates files to prioritize data coherency and high throughput data access.

Cheaper to migrate compute than data: Hadoop avoids network congestion by letting applications process data closer to where it lives.

Cross-platform portability: Hadoop is not tied to specific hardware or software, making it a good choice for distributed computing applications.

Apache Hadoop structure

Four Hadoop modules comprise the core structure of an implementation:

  • Hadoop Common’s Java-based utilities and libraries are shared by the other three modules.
  • Hadoop YARN (Yet Another Resource Negotiator) provides resource management and schedules tasks and jobs.
  • Hadoop Distributed File System (HDFS) is a high-throughput distributed file system designed to run on commodity hardware.
  • Hadoop MapReduce creates massively parallel data processing applications in a Hadoop infrastructure.

Hadoop ecosystem

Hadoop’s success has spawned a diverse ecosystem of related projects that lets organizations develop distributed computing applications. Examples include:

  • Databases (Apache HBase and Cassandra)
  • Data warehouses (Apache Hive)
  • Compute engines (Apache Spark)
  • Utilities (Apache Oozie, Zookeeper, Sqoop, etc.)
  • Programming tools (Apache Pig and Tez)

Given Hadoop’s prominent role within enterprise data architectures, many integrations are available for other platforms, such as Apache’s Kafka distributed event streaming platform.

4 Strengths of Apache Hadoop

Hadoop represented a major change to industry standards and it came with many advantages. These include:

1. Scalability: Hadoop deployments can range from single-node instances to hundreds of clusters with nodes in the petabyte range.

2. Cost: Replacing proprietary storage systems with free, open-source software and commodity storage made big data analytics more accessible to smaller organizations.

3. Versatility: Data warehouses hold structured data, but Hadoop data lakes support structured, semi-structured, and unstructured data.

4. Adaptability: Hadoop’s growing ecosystem allows organizations to use the storage framework in many scenarios and use cases.

Many of these advantages are due to how Hadoop handles big data on distributed storage. A Hadoop cluster consists of a coordinator server and one or more worker servers that spread data across multiple physical or virtual storage machines.

A simple controller node will include a NameNode, DataNode, Job Tracker, and Task Tracker. A secondary NameNode prevents file corruption and data loss in a large cluster. Worker nodes consist of a DataNode and a TaskTracker.

NameNodes track files, their locations, and their metadata. They also perform file system management, including a block count, data locations within DataNodes, and replication locations. DataNodes hold the data in block storage.

Job Tracker fields MapReduce instructions from client apps and interacts with NameNodes to return data location and metadata. Task Trackers field instructions from the Job Tracker and execute them on the DataNode’s files.

4 Limitations of Apache Hadoop

Despite its many benefits, Hadoop also has certain limitations. These include:

Cost

Historically, Hadoop itself was considered inexpensive because it used Commodity-class servers and storage rather than Enterprise-class servers and storage. However, these servers still combined compute and storage to achieve performance enhancements, which often meant that a certain number of resources remained unused at any given time.

Today, cloud vendors allow businesses to separate compute and storage. This means that you only pay for the storage or compute resources being consumed at a given moment. As a result, data lakes are often considered to be less expensive than legacy Hadoop clusters.

Rigidity

Hadoop was developed primarily for on-premises installations, with cloud installations added as an afterthought. As a result, it struggled to adjust to the rise of cloud computing, being unable to match the rapid scaling and elasticity provided by those systems.

Related reading: See how cloud computing disrupted Hadoop as a technology and led to further innovation on top of it. 

Complexity

Hadoop is a powerful system, but it is also extremely complex and demands significant expertise from its users. For instance, it required MapReduce jobs to be manually written in Java. To address this, significant user training is required to optimize deployment, management, and operations. This can often cause significant costs to the business.

MapReduce

MapReduce struggled with certain analytic computing tasks, offering poor concurrency in many scenarios.

Hadoop is a foundational data lake technology

Where does all of this leave us? Hadoop should be understood as the foundational data lake technology. It created the distributed infrastructure necessary to store large amounts of raw data and analyze that data in new ways, and established the separation of compute and storage, setting the direction for the industry.

Some subsequent technologies did not replace Hadoop, but built on top of it, often making use of the same underlying processes and frameworks.

Related reading: See how Hive was developed to sit on top of Hadoop as a query engine and interact with it in a more user-friendly way.

How does Apache Hadoop ensure data redundancy and fault tolerance?

Hadoop’s assumption of commodity hardware’s unreliability led to a design that minimizes the risk of data loss due to storage device failures. The file system divides data files into blocks and writes them in sequence to optimize drive capacity. At the same time, it stores replicas of each block in different locations. This redundancy reduces exposure to hardware failures or network outages.

HDFS’s standard case applies a 3X replication factor, storing one replica in a node on a local rack, the second on a remote rack, and the third on the remote rack but in a different node. When reading data, HDFS will use the replica closest to the reader to minimize latency and bandwidth consumption.

Hadoop uses heartbeat monitoring to provide fault tolerance. DataNodes regularly send a heartbeat message to their NameNode. If the delay between a DataNode’s messages exceeds a threshold, the NameNode will label it dead and create a new replica.

Parallel processing

Hadoop’s performance advantages come in part from MapReduce’s parallel processing capabilities.

A single MapReduce job is a two-step process. First, the map procedure filters and sorts the data. For example, when querying motor vehicle registry data, the MapReduce program filters and sorts the data by brand. Then, the reduce method summarizes the data to produce a count of the number of registrations by vehicle brand.

More coordination occurs in the multi-threaded, multi-processor environment of a compute cluster.

The MapReduce engine sends code to the various worker nodes in a cluster.

Within a worker node, the mapping procedure finds the local data that matches the query keys and writes it to temporary storage.

Worker nodes will then redistribute data so everything that matches an output key to the same worker node.

Finally, the worker nodes perform the reduce stage in parallel and return the results through the MapReduce engine to the client.

Apache Hadoop migration

Despite its many benefits, Hadoop’s limitations give some organizations reason to look at cloud-native approaches.

Running Hadoop in a data center does not decouple compute from storage entirely. As a result, they cannot scale independently, raising infrastructure costs.

Hadoop’s origins in the data center have consequences since different assumptions drove the development of cloud environments. Hadoop can’t match the rapid scaling and elasticity the cloud offers.

Managing Hadoop requires data engineers with specialized skills in writing MapReduce Java code, limiting accessibility. While Hive provides a SQL-like interface to MapReduce, the extra layer impacts query performance.

Starburst Galaxy can streamline your migration from Hadoop to the cloud

Powered by Trino’s open-source massively parallel SQL query engine, Starburst Galaxy’s performance optimizations can accelerate interactive queries by more than forty percent. Galaxy’s universal discovery, governance, and sharing layer — Gravity — provides an easy-to-use interface for managing and accessing data as you prepare for a successful Hadoop migration.

Evaluate the current environment: Add Trino as a compute engine for your Hadoop system and use Gravity’s discoverability tools to analyze your datasets.

Select a cloud platform: Starburst works with the major cloud storage providers, so that you can strike the right balance of features, performance, and cost.

Design your cloud architecture: Choose the table and file formats to use on your object store. Gravity’s access control and governance features let you create granular security and data privacy policies.

Plan your migration: By creating a single point of access to your data infrastructure, Starburst Galaxy makes data migrations transparent to data users.

Using Apache Hadoop and Starburst together with Optum

Starburst also enhances Hadoop data lakes by making data more accessible, integrating other data sources, and significantly improving query performance. UnitedHealth Group’s information services subsidiary Optum had data siloed in a Hadoop data lake and several warehouses and databases. Analyzing this data involved data copies and complex ETL pipelines, which increased data privacy risks. Starburst offered several capabilities to address these challenges.

With connectors for over fifty enterprise data sources, Starburst federates an organization’s data architecture within a single interface. Data consumers get access to every source, letting them generate deeper insights to drive decision-making. Similarly, data teams use the Starburst interface to streamline data source management.

Starburst further enhances accessibility by letting data consumers use ANSI-standard SQL to write their queries. Data scientists developing machine learning algorithms can explore and discover data without help from over-subscribed engineers. Business intelligence analysts can use the SQL-compatible tools they already know.

Starburst Gravity’s governance features let you create role-based and attribute-based access control policies at the file, table, row, and column levels and apply them in multi-cluster environments to ensure compliance with data privacy and sovereignty regulations.

After making Starburst the analytics interface to their on-premises and cloud storage systems, Optum saw a doubling and tripling of query performance over Spark and a tenfold improvement over Hive. And even though utilization rates increased dramatically, Starburst’s centralization of permissions and policy management made global compliance more straightforward.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.