Why we need Hadoop alternatives

Strategy
  • Cindy Ng

    Cindy Ng

    Sr. Manager, Content

    Starburst

Share

Apache Hadoop’s distributed storage and processing framework solved turn-of-the-century challenges by letting companies manage the large scale data of the day on their data centers’ commodity hardware. Hadoop’s limitations spawned an ecosystem of complementary solutions. However, the scale of data today demands a more modern approach to data storage and analytics. 

This article will review Hadoop, its limitations and alternatives before explaining how Trino’s distributed SQL query engine is the basis for a modern, federated data architecture.

Why we need Hadoop alternatives

Engineers at Yahoo! created Hadoop to solve a problem facing their search operations. Proprietary systems were too expensive and inflexible to manage a flood of indexing data. Enabling their on-premises data centers to do the job provided an easier, more scalable way for Yahoo! to index the world wide web.

Nearly two decades later, Hadoop’s framework imposes similar limitations as companies deal with the volume, velocity, and variety of modern enterprise data.

What is Hadoop?

The Hadoop framework distributes data across multiple machines, using parallel processing workflows to analyze and process that data. The framework’s core capabilities derive from four modules:

  • Hadoop Common is the data platform’s shared library.
  • Hadoop Distributed File System (HDFS) manages data files across the data center’s racks and servers.
  • Hadoop MapReduce is a Java-based programming model for running large-scale distributed workflows.
  • Hadoop Yet Another Resource Negotiator (YARN) manages resources for MapReduce jobs.

A Hadoop cluster consists of multiple HDFS and YARN nodes running on each server. Fault-tolerant replication ensures that the failure of any given server or even an entire rack does not result in data loss.

Hadoop ecosystem

Hadoop helped prove that open source could provide performant, reliable solutions to enterprise-scale requirements. However, the framework only addressed one aspect of an enterprise’s data architecture. The Apache Software Foundation supported an entire Hadoop-based ecosystem to address these needs.

For example, Apache HBase is a non-relational distributed database running on top of Hadoop that can process large volumes of sparse data. Similarly, Apache Ambari was developed as a Hadoop cluster management tool.

Hadoop limitations

Hadoop’s limitations when addressing modern data challenges lie in its early-2000s origins as a data center solution. For example, HDFS block sizes are typically 64MB which works for the large datasets Hadoop was designed to process. But e-commerce clickstreams and other modern data sources generate large volumes of much smaller files. HDFS creates one mostly-empty block for each file — Hadoop’s small file problem. These small blocks add to MapReduce’s already lengthy processing times.

MapReduce’s complexity also limits Hadoop’s role in data-driven organizations. Outside the data engineering community, few people understand MapReduce’s specialized version of Java. Finding and keeping this talent is expensive. Moreover, data teams become analytics gatekeepers when data consumers’ have nowhere else to turn.

Another Hadoop limitation, high-latency processing, comes from Hadoop’s original purpose as a batch processor for Yahoo!’s large datasets which prioritized reliability. MapReduce writes each interim result to physical storage to ensure resiliency and recoverability. These read/write intensive workloads limit Hadoop’s utility for ad hoc and real-time analytics. 

Hadoop Alternatives: Big Data Services in AWS and other HDFS

The limitations inherent to Hadoop’s core technology led to the development of varied Hadoop alternatives. These options range from replacing one or more Hadoop core modules to providing Hadoop-like alternatives running in the cloud.

SQL-on-Hadoop Solutions

MapReduce’s accessibility issues became apparent shortly after Hadoop’s release as companies applied the framework beyond Yahoo!’s original use case. Recognizing that the population with SQL experience dwarfs the population of MapReduce developers, the Apache Hive and Apache Impala projects developed SQL-based solutions running on Hadoop.

Apache Hive acts as a translation layer between users familiar with SQL and Hadoop’s MapReduce functionality. Although simpler to use, Hive makes the latency issue worse by adding additional steps to any processing job.

Apache Impala bypasses MapReduce to generate low-latency query results using Hive’s SQL syntax, metadata, and other features. Impala gives a Hadoop data platform the same ad hoc SQL query performance as a proprietary data warehouse.

Hadoop in the cloud

Hadoop’s capabilities are limited by the data center’s physical infrastructure. Expanding its capacity required careful planning and budgeting, but could not offer the dynamic scalability available in the cloud. To bridge that gap, cloud service providers like Amazon Web Services (AWS) offered ways to deploy Hadoop without requiring an on-premises data center.

Amazon EMR (originally Elastic Map Reduce) extends HDFS so companies can write data to Amazon S3’s object storage service. Microsoft Azure’s HDInsight and Google Cloud’s Dataproc provide similar Hadoop-in-the-cloud services. However, moving Hadoop from the data center to the cloud does not address the framework’s inherent limitations.

Alternative data processing frameworks

While the Hadoop ecosystem lets companies build sophisticated data architectures on top of HDFS, few companies have Hadoop-exclusive infrastructures. The need for more platform-agnostic data processing technologies led to the rise of projects like Apache Spark and Apache Flink.

Data scientists at the University of California, Berkeley developed Spark to support machine learning projects. Spark’s in-memory data processing reduces latency and makes ad hoc, iterative queries on HDFS more viable.

German researchers developed what would become Flink, a distributed analysis engine that uses Java and Scala APIs to process streaming and batch data. Flink works with sources besides HDFS, including the open source NoSQL database Cassandra.

Object storage in cloud data lakes

The data center’s cost and inflexibility led organizations to seek alternatives from cloud object storage services like Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage. Object storage offers a more cost-effective and scalable way to create large data repositories with all the benefits of being in the cloud.

Initial attempts to create these data lakes suffered from object storage’s limited metadata and underdeveloped management tools. Data became difficult to find and the data lakes turned into data swamps.

Cloud data warehouses

Google Cloud’s BigQuery and AWS’ Redshift represent a new generation of cloud-native data warehouses that provide another alternative to Hadoop. These services offer a middle ground between traditional proprietary warehouse vendors and a roll-your-own open source solution like Hadoop. Cloud data warehouses are more affordable and scalable than traditional solutions. At the same time, these platforms create a soft form of vendor lock-in that complicates multi-cloud architectures.

Trino (formerly Presto), SQL-on-Anything

Trino is a distributed SQL query engine that runs efficient, high-performance queries on exabyte-scale data sources. Combined with Iceberg’s open table format, Trino forms the basis of a modern open data lakehouse, a data architecture that blends the scalability of object storage with the big data analytics of a warehouse.

Trino architecture and features

A coordinator and one or more worker nodes comprise Trino’s architecture. Workers execute tasks and process the data they get from the coordinator and data sources. The coordinator parses and plans queries and manages the worker nodes, enabling massively parallel processing. Trino processes data in-memory to minimize latency. Connectors link the nodes in Trino clusters to data sources. Each connector’s catalog describes the source’s schema

Trino’s modern design enables advanced features for big data analytics, including:

Massively-parallel processing: Trino leverages the scalable compute of modern cloud service providers to deliver fast, cost-effective results.

ANSI-standard SQL: Using ANSI-standard SQL minimizes learning curves and ensures compatibility with business intelligence tools like Power BI.

In-place analysis: Trino can query data where it lives rather than waste resources copying data into a central staging area.

Deploy anywhere: Trino can run on-premises or in your preferred cloud environment.

Federation: Trino connectors let a single query access data from multiple sources, federating data sources in a single virtual access layer.

SQL-on-anything with a Starburst open data lakehouse

Starburst, the enterprise-ready implementation of Trino, builds upon the open source technology to unify enterprise data architectures behind a single point of access. Federated connectors link data lakes, warehouses, operational databases, and other sources to Starburst and make them accessible through the SQL your analysts and applications already use.

This SQL-on-anything capability frees data teams from being the sole source of expertise. Authorized users can access data directly, accelerating speed to insight across various use cases, including:

Data science: Machine learning projects depend on an iterative exploration and development process. Starburst and Trino let data scientists run and refine their own queries.

Business data analysis: Having direct access to data sources through SQL-compatible applications like Tableau let analysts generate ad hoc reports for decision-makers operating on tight deadlines.

Data management: Besides eliminating the request backlog, Starburst streamlines data management. Many complex ETL data pipelines can be replaced with Starburst workflows that pull data from multiple sources in a single SQL query.

Integration with Trino and Starburst

Using Starburst and Trino to integrate your data architecture is the best Hadoop alternative. Starburst’s Hive connector lets you run SQL queries against legacy Hadoop sources with up to twenty-fold increases in performance and 75% lower costs compared to Hive or Impala.

But with connectors to more than fifty enterprise data sources, Starburst and Trino create an open data lakehouse that goes beyond being merely a Hadoop alternative. Starburst unifies Redshift warehouses, Kafka topics, PostgreSQL databases, and more into a unified interface that will turn your organization into a truly data-driven business.