The Apache Software Foundation’s Hadoop project lets companies implement affordable computer clusters in their on-premises data centers. Yet, this legacy technology is poorly suited for today’s cloud-based business models and increasingly diverse data sources.
Let’s examine how Hadoop clusters work, their challenges, and how Hadoop modernization is the path companies must take toward a data-centric future.
Hadoop Cluster Architecture
The Hadoop Distributed File System (HDFS) became an affordable alternative to commercial on-premises data warehouses. Running on commodity hardware in the data center rather than expensive third-party systems, HDFS significantly reduced data processing costs when internet-driven data volumes skyrocketed. Companies could distribute clusters across commodity servers and scale on their terms without being locked into a vendor’s proprietary solution.
Cluster overview
The Hadoop cluster architecture consists of a primary node, the NameNode, and one or more worker nodes, the DataNodes. Technically, a single node could handle both roles, but real-world use cases see NameNodes coordinating thousands of DataNodes.
A dedicated physical or virtual machine runs the NameNode, which manages the namespace of the Hadoop cluster’s file system, maps data blocks to the DataNodes, and coordinates client file access.
Each DataNode runs in its own machine, where it manages the machine’s storage. The DataNode executes the reads, writes, and deletions issued by the NameNode and serves the clients’ reads and writes.
Hadoop Yet Another Resource Negotiator (Hadoop YARN) manages a cluster’s system resources through its Resource Manager running on the NameNode’s machine. Each machine’s DataNode gets paired with a YARN NodeManager, which performs compute monitoring and other tasks. YARN’s Resource Manager coordinates each node’s NodeManager like the NameNode coordinates the DataNodes.
Hadoop MapReduce, the Java-based programming framework for distributing applications, also had daemons running in the cluster of early versions of Hadoop. JobTracker ran in a separate node, receiving MapReduce jobs and distributing tasks to the DataNodes before returning the compiled responses to MapReduce. Each DataNode had its own MapReduce TaskTracker, which reported task progress to the JobTracker. From Hadoop 2 on, YARN’s Resource Manager and NodeManager handle this process.
Data blocks, replication, and fault tolerance
Rather than keeping files in a contiguous section of storage, the HDFS divides a file sequentially into blocks and distributes these blocks across nodes in a cluster. In on-premises systems, this distribution will span multiple servers and racks.
The NameNode manages the HDFS hierarchical directories. When saving a new file, the NameNode maps each block to its respective DataNodes.
However, the NameNode does not touch the data itself. It only stores the file system namespace and blockmap. Only DataNodes handle the data.
A client creating a new file writes data to a cache until reaching the cluster’s block size when it contacts the NameNode. The NameNode records the file’s name and directory and points the client to a DataNode and data block. The client then hands the data to the DataNode. This process repeats until the file is complete and the NameNode commits the file creation.
Replicating a file’s data blocks makes Hadoop clusters more tolerant of machine failures, network outages, and other disruptions that could cause data loss. The NameNode will map copies of each block to other nodes in ways that ensure file recovery should one or more DataNodes fail. As a DataNode finishes writing a block, it passes the data to the next DataNode.
High availability configurations
Although a Hadoop cluster only uses one NameNode, engineers can configure a high-availability cluster with two or more. Should the primary NameNode fail, the secondary NameNode will take its place.
What is cluster size in Hadoop?
Cluster size describes the CPU cores, memory, and storage per cluster node. Engineers must correctly configure a cluster’s size to match the expected workloads. Under-provisioning the cluster will degrade performance, while over-provisioning will drive up compute and storage costs.
Large clusters with thousands of nodes are not uncommon. Three years ago, LinkedIn’s largest cluster had around 10,000 nodes, which pushed YARN’s scalability to the limit.
Challenges of a Hadoop cluster
Hadoop’s initial release came at a time when the cloud was just emerging as a viable alternative to on-premises computing. Since then, internet-driven computing has experienced exponential increases in the volume, velocity, and variety of enterprise data. Companies regularly process datasets at petabyte scales. Unstructured data from social media, Internet of Things devices, and other sources are essential to data analytics and machine learning algorithms.
Hadoop-based data warehouses simply can’t handle today’s large data sets and varied use cases. Among the challenges data engineering teams face managing Hadoop clusters include:
Resource management
Managing Hadoop clusters involves a complex dance between competing priorities. For example, scaling to meet the ever-increasing demand for data requires adding more nodes; however, that will boost network traffic and increase latency.
Data engineering teams must carefully manage storage, CPU, and memory to avoid bottlenecks while meeting performance SLAs.
File size limits
Hadoop and HDFS are designed to process large data files, so a cluster often has 64 GB, or even 128 GB, block sizes. That size limit works for the large datasets batch processed through ingestion pipelines. However, many modern enterprise data sources generate streams of very small files. Writing a small file to HDFS will consume an entire block even though most of it is empty.
Performance limits
Large clusters get bogged down by the way Hadoop processes data. When executing a series of tasks, Hadoop writes each task’s results to storage and must reread those results to complete the next task. This read/write intensive approach consumes storage and compute resources and increases latency. These limits become particularly challenging when processing iterative data science workloads.
Complex technology stack
Getting Hadoop clusters to work efficiently at scale requires stacking other technologies the Hadoop ecosystem developed to address the framework’s limits and meet new use cases. For example, Hadoop MapReduce was always an obscure technology with relatively high latency. Apache Hive provided an SQL-like overlay to make Hadoop more accessible at the expense of even higher latency. Apache Spark bypassed MapReduce entirely to support the needs of data scientists.
Workforce limits
Recruiting and retention are ongoing challenges for Hadoop team leaders. Managing Hadoop clusters requires a special set of skills to thoroughly understand the framework and its company-specific implementation. Despite Hadoop’s widespread presence in the enterprise space, the number of engineers with those skills does not meet demand.
Costs
Open source notwithstanding, Hadoop isn’t cheap. The framework’s origins in the data center result in an architecture that couples storage and compute. Scaling requires an expensive, time-consuming process of expanding the company’s on-premises infrastructure, from new servers and racks to additional power and cooling systems.
Signs you need to modernize
These and other challenges of Hadoop’s aging architecture place organizations under significant pressure to keep pace with the demands of data-driven decision-making. Some signs that your organization needs to modernize from Hadoop include:
Sluggish time to insight — Data products, automation, and streaming data bring decision-making closer to real-time. Hadoop’s technical limits mean decision-makers throughout the organization can’t get the insights they need fast enough to compete in dynamic markets.
Data inaccessibility — Data democratization makes companies more effective by giving people direct access to data-driven insights. Hadoop’s complexity increasingly forces data consumers to rely on already overburdened data engineers for even simple requests.
Declining ROI — Hadoop requires expensive infrastructure and staffing, but its challenges limit the return on those investments. Shifting IT budgets to more modern data analytics approaches would produce better returns from more efficient, faster access to business insights.
Starburst and Trino: A gateway to modernization
Starburst customers have used our data analytics solution to streamline their Hadoop modernization programs and migrate to a modern data lakehouse architecture. Leveraging object data storage services like Azure Blob Storage and Amazon S3, an open data lakehouse combines Starburst’s enhanced Trino SQL query engine with the Apache Iceberg open table format.
Starburst has helped companies realize ten-fold increases in query performance over their previous Hadoop platforms with a significant reduction in storage cost. Unlike Hadoop, the lakehouse architecture decouples storage from compute so data teams can optimize the two sides independently. In addition, Trino queries process data in-memory to eliminate the latency-inducing bottlenecks of Hadoop clusters.
Modernization also expands the data an organization’s users can access. An assumption central to Hadoop’s model is that a single repository can host all the data users need to manage the business. This was never true. It’s even less true today, considering the many modern data sources Hadoop struggles to integrate.
A modern data lakehouse does not repeat this model in the cloud. Instead, Starburst virtualizes entire data architectures, creating a unified interface that provides a single point of access to any data users need.
Using Starburst to modernize Hadoop and adopt a modern open data lakehouse architecture makes organizations more agile and efficient while laying the groundwork for the future of big data analytics.