Enhancing Apache Hadoop Data Management with Trino and Starburst

Share

For almost two decades, companies have built big data processing architectures based on the Hadoop ecosystem. To extend the Hadoop project beyond its core design, they use Apache Software Foundation projects like the HBase relational database or Oozie’s workflow management resources. 

Yet, Hadoop cannot address the full scope of enterprise data requirements.

Apache Hadoop and extended data management is a way for enterprises to store and process large amounts of data using distributed computing on commodity hardware. The open-source framework popularized big data analytics by letting companies affordably manage large datasets. Revolutionary for its time, the Hadoop framework poses complexity, performance, and latency challenges for how organizations use data today.

This article will discuss how the Trino massively parallel processing SQL query engine, enhanced by Starburst, significantly improves Hadoop data management.

Data Ingestion

Ingestion, the landing, and staging of raw data from a source, are the traditional first steps in Hadoop data integration. However, batch ingestion is no longer the only process for integrating enterprise data.

Batch Ingestion

The relative obscurity of Hadoop’s programming model is challenging. Hadoop MapReduce uses a unique variant of Java that few people within, much less beyond, the data management community understand. Building basic batch processing jobs requires input from a data team’s overstretched Hadoop experts. Projects like Apache Sqoop, now archived, let data teams extract data from relational databases using more widely understood SQL queries.

How Starburst Helps

Besides MapReduce’s obscurity, ETL ingestion pipelines impose storage and network penalties as large volumes of data flow from a relational database to Hadoop’s staging area. Trino’s SQL engine can pushdown queries to process data directly from the source. Rather than transferring huge datasets into interim storage, Trino lets pipelines load the final results into the destination.

Real-Time Ingestion

Hadoop wasn’t designed for the constant flow of small files generated by e-commerce, social media, and other real-time sources. Stream processing frameworks like Apache Kafka and Flink collect data from these sources for ingestion into Hadoop repositories.

However, streaming sources are sparse, and most of the data they generate is not useful. Given how the Hadoop Distributed File System (HDFS) handles small files, ingesting real-time data severely affects storage capacity and cost.

How Starburst Helps

Trino’s Kafka connector lets data managers query streaming sources directly without transferring full data streams. Starburst administrators can add schema metadata to a Kafka topic, allowing users to run real-time SQL queries from their preferred application. Starburst’s JSON functions help queries apply schema-on-read to semi-structured data streams. As with batch ingestion, Starburst lets engineers pushdown query processing to reduce storage and compute costs.

Data Processing

Data processing of large datasets enforces data warehouse schemas during integration and supports data analysis. Hadoop made processing more cost-effective than the proprietary systems of twenty years ago, but inherent performance limitations pose challenges for today’s large workloads.

Hadoop: MapReduce

MapReduce is one of Hadoop’s core modules, along with HDFS, Yet Another Resource Negotiator (YARN), and Hadoop Common. MapReduce and YARN work with the NameNode and DataNodes in Hadoop clusters to create parallelized workflows that reliably process data. However, Hadoop prioritizes high availability and fault tolerance over performance, which imposes significant latency penalties on high throughput jobs.

How Starburst Helps

Trino bypasses MapReduce and YARN entirely to query Hadoop data directly. Starburst further improves Hadoop data processing by enhancing Trino with cost-based optimizations, dynamic filtering, cached views, and more. Creating Trino queries through Starburst accelerates processing jobs and reduces network traffic to deliver results faster.

Data Storage

Hadoop’s innovations let companies build analytics infrastructure on the commodity data storage systems in their data centers rather than expensive proprietary data warehouses. However, Hadoop’s file system struggles to keep pace with today’s data storage demands.

Hadoop: HDFS (Hadoop Distributed File System)

Hadoop’s design assumptions prioritize reliability and availability, spreading multiple copies of each file across servers and racks through replication. However, this means that every bit of data stored in HDFS requires three bits of storage capacity in the data center. The realities of IT budgeting mean scaling physical infrastructure cannot happen on demand. Companies must either accept storage limitations or over-invest in storage capacity.

How Starburst helps

Starburst offers connectors to over fifty enterprise data sources, both on-premises and in the cloud, to support a federated data architecture. Data remains where it lives and is just as accessible as data stored in Hadoop. Federation minimizes data transfers since Hadoop’s repository doesn’t have to hold data from every source. Similarly, companies can store more data in scalable, cost-effective cloud services like Amazon S3 or Azure Blob Storage.

Query Engine

Many projects in the Hadoop ecosystem try to address the framework’s inherent performance and accessibility limits.

Hadoop: Hive, Spark SQL

Besides being relatively obscure, MapReduce suffers from performance issues due to its read/write-intensive on-disk processing model. Alternatives try to address these accessibility and performance issues. 

Apache Hive overlays a SQL-like interface that converts SQL statements into MapReduce tasks, making Hadoop more accessible but adding even more latency. Apache Spark’s SQL engine replaces MapReduce to improve query performance, especially for data science projects and training machine learning algorithms.

How Starburst helps

Ad hoc and interactive queries for big data analytics were Trino’s original use cases. Over time, Trino’s capabilities grew to support batch processing and Starburst has extended it further to support stream processing. With an accessible SQL interface, Starburst lets engineers, data scientists, and analysts run complex petabyte-scale queries against multiple sources. Starburst expands data accessibility while reducing data engineering workloads.

Metadata Catalog

Efficient data analytics and management depend upon rich metadata to find, query, and process jobs. HDFS metadata only supports file operations, so analytics needs supplemental metadata sources.

Hadoop: Hive Metastore

Another Hive development priority was providing rich metadata to support analytics. Hive does not store this metadata with the actual data. Instead, it uses a separate relational database called the Hive Metastore to store a table’s schema, data types, and other information. The Metastore gives Hive’s query engine and other elements of the Hadoop ecosystem a central location for accessing metadata.

How Starburst helps

Starburst’s Hive connector replaces the Hive runtime with Trino while using the Hive Metastore and the HDFS storage system. Companies using Amazon EMR, a Hadoop-in-the-cloud service, can do the same with Starburst’s AWS Glue connector.

Open table format

Hadoop is a data processing framework, but HDFS only manages files, not the data within those files. Additional systems create the tables that structure data to support analytics and other use cases.

Hadoop: Hive

Hive’s Metastore and SQL-like query layer turn Hadoop into a data warehouse for analyzing massive datasets. The metadata defines column names, data types, and other aspects of a table’s schema the query layer can use to find and process data. In this respect, Hive is also a table format.

However, as an analytics platform, Hive is difficult to keep up-to-date because of how HDFS stores data, Metastore synchronization, and other issues.

How Starburst helps

As enterprises migrate to the cloud, they take advantage of modern open table formats. Starburst supports Delta Lake and Hudi. However, pairing Starburst with Iceberg goes beyond a data lake to create a new open data lakehouse architecture called the Icehouse, which leverages scalable object storage, Iceberg’s rich metadata, and Trino’s performant federated query engine. Data stored in Hadoop is just as accessible as data stored anywhere else, making it easier to migrate from legacy on-premises systems.

Data security

Hadoop’s developers assumed companies would run their systems in data centers protected by secure perimeters, so security was not central to their designs. Later releases added encryption, authentication, and other security features, but protecting data in Hadoop remains challenging.

Hadoop: Ranger

Enterprise Hadoop vendors like Cloudera extend and enhance the framework’s security capabilities, but that only helps companies that go the commercial route. Companies dealing with older versions of Hadoop turn to security projects like Apache Ranger to ensure consistency across Hadoop clusters and the Hadoop ecosystem.

How Starburst helps

Enterprise data architectures extend beyond the Hadoop platform, so centralizing Hadoop security requires parallel administration with non-Hadoop systems. Starburst’s federated architecture centralizes data access control so administrators can apply fine-grained role-based and attribute-based policies across all data sources, Hadoop or otherwise. Starburst integrates with Ranger and its commercial implementation, Privacera, as well as with Immuta and AWS Lake Formation.

Data governance

As with security, governance was not a key design factor during Hadoop’s development, forcing companies to find alternate means to enforce data standards. This is particularly important in healthcare and other regulated industries, but governance concerns any company dealing with data privacy laws.

Hadoop: Atlas

Apache Atlas is a framework for enforcing governance policies within the Hadoop ecosystem. Metadata and classifications let governance teams define data and identify sensitive or regulated information. Lineage tracking, search, and discovery tools make it easier for Hadoop users to understand data quality and provenance.

How Starburst helps

The managed analytics platform Starburst Galaxy has a universal data discovery, governance, and security service called Gravity. From its unified interface, Gravity lets administrators define role-based and attribute-based access controls, manage metadata, and streamline user management across all connected data sources. SSO and OAuth APIs integrate Starburst with a company’s existing security stack, including Okta, Collibra, and Alation.