Share

When it first launched, Apache Hadoop was a tectonic shift in the big data landscape. Today, it’s still a popular choice for processing terabytes of data for multiple use cases, from data analytics to AI. It’s easy to see why. Its long history, stability, active ecosystem, and power make it suitable for numerous use cases in data engineering, data analytics, data science, data storage, and Machine Learning (ML).

On the flip side, Hadoop is now nearly two decades old. This means that Hadoop is no longer the pinnacle it once was. We have more modern technologies that take advantage of 20 years of progress in cloud data storage and technology. Using Hadoop indiscriminately means being saddled with its limitations, which can increase the time and cost of deploying new data-driven solutions to production. 

Of course, in these situations, taking a “migrate everything yesterday” approach is never a reasonable strategy. Data teams need a flexible architecture that enables them to leverage new technologies where appropriate while also continuing to use what works.

In this article, we’ll dig into how to migrate select workloads from Hadoop towards a more modern solution – an open data lakehouse – while still getting maximum value from your existing Hadoop clusters

Hadoop’s role in the data ecosystem

Unlike modern data transformation and warehousing approaches, which separate compute from storage, Apache Hadoop uses a fused compute/storage model to distribute data across a cluster of machines, transforming and analyzing it via parallel workloads. It consists of several components collectively known as the Hadoop ecosystem

Let’s look at the Hadoop cluster architecture in more detail. 

Hadoop Distributed File System (HDFS): A fault-tolerant distributed filesystem based on the “write once, read many” principle and optimized for heavy read activity. HDFS clusters use a master node, known as the namenode, to monitor and maintain cluster health. Worker nodes complete data processing using a shared-nothing architecture that employs parallel computation

Hadoop YARN: In the proud tradition of “Yet Another” tech tools like Yacc and YAML, YARN stands for Yet Another Resource Negotiator. As the name says, this component is a resource manager that organizes job scheduling and resource monitoring across the cluster.

Hadoop MapReduce: The original programming framework on Hadoop for managing terabyte database using parallel processing on clusters of a master node and multiple datanodes. It processes chunks of input data in parallel in a map phase and then performs more complex activities, such as aggregations and sorting, in reduce tasks. The system works to coordinate worker nodes using a nodemanager and jobtracker as it processes the data needed for the MapReduce job

For more detailed technical information on Hadoop, check out our free Starburst Academy course on Hadoop and data lakes.

The limitations of a Hadoop cluster

Even today, Hadoop has many advantages when it comes to large-scale data processing. It was designed to run on commodity hardware, and it scales up well by adding more nodes. It is highly performant, resilient, and scalable for long-running batch analytics use cases where the data must reside on-premise. As the first widely accepted data lake for large datasets, it integrates data from dozens of sources.

However, Hadoop also has some limitations for today’s big data analytics workflows. These include: 

  • Complexity
  • Coupling of compute and storage
  • Limitations on scalability 
  • Use case limitations
  • Security

Let’s look at each of these in detail. 

Hadoop’s complexity

The complexity of Hadoop is by far its biggest challenge, particularly when compared to data lakehouse solutions. 

There are a few reasons for this:

  • A Hadoop cluster is hard to set up and maintain. Cloud services like Amazon EMR dampen much of the pain of provisioning but not configuration and maintenance. The cluster usually requires detailed fine-tuning to get the performance you need, and can only operate on AWS.
  • Once set up, Hadoop can be hard to manage. The native processing framework, Java MapReduce, has a complex programming model that requires significant programming expertise to take advantage of. That barrier to entry is the reason other solutions such as Hive and Spark are used instead of MapReduce.
  • Due to the complexity of MapReduce, the Hadoop open-source programming ecosystem contains a dozen+ different development options that can be confusing to navigate.

Because of all the specialized knowledge involved, Hadoop installations typically require a large team to deploy and maintain. Compare this to data lakehouses, or even some data warehouses and data lakes which, once configured, can be accessed by anyone with knowledge of SQL

Coupling of compute and storage 

Alternatives to Hadoop decouple their compute and storage, allowing them to scale independently. Data warehouses such as Snowflake and Amazon Redshift began this shift in architecture. By storing data at rest in object storage systems like Amazon S3 and only launching VMs to access data as needed, companies can scale compute and storage quickly and independently, without interfering with running queries or redistributing data across storage nodes. 

Unfortunately, Hadoop was born in a pre-cloud world. In Hadoop, you can’t swap in more storage on demand because storage and compute are tied together. That means you either need to over-provision storage or perform tricky resizing operations that can take hours to complete. Depending on what type of storage you use, you may not be able to resize at all.

Because Hadoop expects storage and compute to be tethered together by default, you can’t scale them independently without involving additional technology to mitigate this behavior. For example, if you max out your storage capacity, you need to add more compute capacity with additional mounted volumes – whether you need the accompanying processing power or not. That raises data processing costs compared to more modern solutions.

Interesting side note: When Hadoop was created, this coupling of storage and compute was a benefit! That’s because co-locating them together allowed Hadoop to process data in large volumes quickly. 

In the cloud era, however, this benefit has become a liability. Hive (which we discuss more below) has sought to remedy this by making it easy to locate data in object storage.

Hadoop use case limitations

Hadoop works great for many use cases. However, you’ll run into a few obstacles – particularly surrounding cost – if you try and make it your all-purpose data utility.

As originally conceived, MapReduce was on-prem-focused and oriented towards batch processing. However, a lot of additional technology has been built on top of it to make it suitable for a wider range of use cases. That’s led to an increasingly complex architecture that can be hard for developers to grasp and master.

Security 

Hadoop security has improved greatly over the last 20 years. However, most admins still wrestle with the base authentication for cluster processes, which uses Kerberos – a complicated protocol that’s hard to understand and manage.

Getting the most out of your Hadoop cluster

Due to its proven history and legacy, many companies are still onboarding onto Hadoop. Eventually, however, they run into the challenges listed above and start looking at alternatives. (That’s what brought you here today, right?)

Fortunately, there are a few ways to extend the life of your Hadoop investment while transitioning to a more modern and versatile data architecture at your own pace. 

Using Apache Hive

Most teams using Hadoop will be using Apache Hive – and if you’re not, you should be. Hive is a SQL abstraction built on top of Hadoop that overcomes some of its original limitations. With Hive, you get a data warehouse-style implementation that supports a SQL-like interface. You can also use object storage providers such as Amazon S3.

However, while Hive’s an improvement, it has its own limitations:

  • File list operations are slow when using an object store (it’s admittedly MUCH faster when using HDFS, as it relies on HDFS’s NameNode, which stores its entire metadata catalog in memory)
  • Update/delete are slow, as Hive requires replacing the entire file (more modern applications use Hive ACID to work around this very problem)
  • Schema changes are costly, as they require rewriting the entire data set

Creating a data lakehouse

A more flexible next step is building out a data lakehouse, an architecture that combines the benefits of a data lake with a data warehouse. 

Data warehouses provide high performance and reliability, all accessible via ANSI SQL queries. Data lakes provide access to structured data, unstructured data, and semi-structured data along with access to cheap object storage. The data lakehouse merges these benefits together, giving you the performance and usability benefits of a data warehouse along with the low-cost storage and diversity of the data lake. 

Data lakehouses support high performance for both read and read scenarios and ACID transactions. Since they can be queried with SQL, they’re also more accessible to business users primarily interested in BI. This means you can support write-heavy scenarios, streaming data scenarios, and other use cases not easily supported under Hadoop.

Depending on the data lakehouse solution you use, you may also be able to bring in your data currently hosted in Hive without migrating it. With a data lakehouse solution like Starburst, you can connect and access your Hive data via the Hive connector, which can deliver results up to 20 times faster and at up to 75% lower cost. 

Migrating to an Icehouse architecture

While moving to a data lakehouse and accessing your Hive workloads from there is another positive step forward, over time, you’ll gain the biggest benefits from bringing those workloads into the lakehouse and storing them in Iceberg format. 

Apache Iceberg is an open table format built for the cloud that addresses many of the shortcomings in the Hive format. In particular, it adds full and performant support for updates/deletes, in-place schema changes, table versioning, time travel, and partition evolution. 

The Starburst data lakehouse uses the Icerberg format, combining it with the Trino SQL query engine to create what we call the Icehouse. Icehouse architecture can support a large range of data-driven use cases – BI, ELT,  data-driven apps, and analytics – giving it more versatility than either native Hadoop or Hive.

The Icehouse isn’t an all-or-nothing proposition, either. Thanks to the Hive connector, you can keep your Hive workloads where they are for now, accessing their data with better cost and performance than Hive alone. Over time, as you build out your data lakehouse, you can port some of these workloads to Icehouse for even greater performance and cost savings. 

Conclusion

Hadoop still has life left in it. However, its limitations are apparent in the cloud era. Companies need to think hard about building a bridge to the future. 

Using a data lakehouse like Starburst means you can take full advantage of a modern data architecture that brings data from Hive, lakehouses, cloud storage systems, on-premise, and other systems together in one location. The data lakehouse takes technology out of the driver’s seat so that you can run your data business as a business.