Apache Hudi
What is Apache Hudi?
Apache Hudi (pronounced “hoodie”) is a transactional data lake platform first developed by Uber to bring data warehouse-like analytics capabilities to data lakes built on the Hadoop Distributed File System (HDFS). Hudi’s incremental processing framework gives data lakehouses near real-time analytics capabilities.
What are the core features of Hudi?
There’s more to this open-source project than the Hudi table format. Hudi’s use cases include stream processing of logs and events, near real-time analytics, machine learning, and data lakehouse management. Some of Hudi’s core features include:
Incremental processing
ETL data pipelines are the traditional ingestion methods for feeding new data into a data lake. The data pipelines run periodically with batches separated by an hour a day or more, depending on the data source. Hudi supports this traditional approach by storing data in the Parquet open file format. With each batch update, Hudi versions and rewrites the files — the Copy On Write (COW) pattern.
A downside to this approach is that batch processing makes data lakes unsuitable for analyzing real-time data, which remains the realm of expensive, often proprietary, real-time analytics platforms that can deliver query results with sub-second latencies.
However, another class of analytics requirements falls between the capabilities of these high-performance systems and conventional data lakes. Consider an e-commerce system that monitors user website activity to tailor the customer experience. It needs data quickly to influence purchasing behavior, but that data doesn’t have to arrive instantly. Getting updated information within a few minutes may be sufficient.
Hudi supports this kind of near real-time use case because it isn’t limited to batch data processing. In a Merge On Read (MOR) pattern, Hudi stores data using both columnar and row-based file formats, typically Parquet and Avro, respectively. Avro files store updates, which are periodically compacted to create new Parquet files. As a result, the MOR pipelines can add new data at a higher frequency than the COW pipelines.
Hudi can ingest data streams from real-time sources like Kafka to deliver query results with relatively low latency. Companies can use Hudi’s batch and streaming capabilities to consolidate more data in the data lake while reducing reliance on expensive real-time systems.
ACID compliance and time travel
Hudi data lakes can process ACID transactions, making them suitable for applications that must preserve data validity. Tables have dedicated timelines where every action gets logged with millisecond granularity, guaranteeing an atomic record of all table changes. Hudi provides snapshot isolation between writer, reader, and table services processes, as well as granular concurrency controls to ensure consistency.
Table snapshots allow users to perform time travel queries supporting historical analysis of data changes, rollbacks, and debugging.
Data lake management
Hudi offers a suite of services to better manage a data lake’s object storage, including cleaning, clustering, compaction, file sizing, and indexing. Schema-on-write enforcement lets Hudi support schema evolution to make pipelines resilient to inevitable changes in table design.
Query optimization
Hudi’s metadata table is an internal serverless MOR table that stores table metadata to improve query read and write performance. This metadata table eliminates file list operations and improves query planning by providing rapid access to column statistics.
Multi-modal indexing enhances the metadata table by extending new index types as new partitions. Making this rich metadata available to query engines like Trino reduces compute workloads and delivers results faster.
Feature |
Apache Hudi |
Transaction support (ACID) | Yes |
File format | Parquet, ORC, Avro |
Schema evolution | Full |
Partition evolution | No |
Data versioning | Yes |
Time travel queries | Yes |
Concurrency control | Optimistic locking |
Object store cost optimization | Yes |
Community and ecosystem | Growing |
Using Apache Hudi for near real-time analytics
Hudi can ingest data within minutes, allowing for the rapid, simultaneous analysis of multiple datasets. We’ve already mentioned how near real-time analysis helps e-commerce companies personalize customer experiences. The same is true in healthcare, where a 360-degree view of patients based on the freshest data can deliver positive medical outcomes.
Near real-time analytics also plays an essential role in governance automation, allowing machine learning models to quickly detect anomalies that could signal cyber security threats or financial fraud.
Furthermore, making governance systems more responsive significantly improves compliance with privacy regulations like the European Union’s General Data Protection Regulation (GDPR). Batch processing introduces an hours-long delay between when a company collects personally identifiable information (PII) and when PII is accessible to governance systems. Hudi’s incremental processing lets governance systems recognize PII within minutes for rapid policy enforcement.
How does Apache Hudi handle data partitioning and bucketing?
Partitioning and bucketing improve query efficiency by letting queries scan part of the data rather than the entire data table. Partitioning splits large datasets into smaller pieces. When Hudi saves data, it assigns records a unique pairing of record and partition keys. The partition keys may be based on timestamps or specified columns in the data table.
Hudi’s bucketing capabilities let you group files within each partition based on the partition key. A Bucket Index defines a fixed number of buckets within a partition. However, this could be an issue with skewed data since these buckets will all be the same size. A more recent Hudi feature, Bucket Index with Consistent Hashing, allows buckets to grow and shrink independently based on the workloads.
How does Apache Hudi support incremental and change data capture (CDC) in analytics?
Change data capture (CDC) brings row-based data from transactional databases into the column-based tables in a data lake. Hudi includes a Debezium connector that streamlines CDC integration.
The connector monitors the transactional database and sends change records as Avro messages to a table’s dedicated Kafka topic.
Deltastreamer is a Hudi utility for pulling changes from DFS, Kafka, or other upstream sources for ingestion into Hudi tables. When invoked on the table’s Kafka topic, Deltastreamer pulls the changes and writes them to the appropriate Hudi table.
Apache Hudi scalability and performance characteristics
Hudi’s capabilities deliver fast query performance that scales easily as data volumes grow. Cloud storage management services automatically optimize files for efficient query performance. For example, queries become less efficient as file sizes shrink and file count expands. Rather than creating new file groups for incoming data, Hudi will distribute inserts to existing groups.
Multi-modal indexing is another performance-enhancing feature. For example, Hudi creates a separate metadata table that enables data skipping. Rather than checking footers in thousands of files to see if they contain relevant data, queries use this metadata index to find the most relevant files and pass over the rest.
How can I integrate Hudi with Amazon S3 for data lake storage?
Hudi offers a straightforward path using Spark for integration with Amazon S3, the AWS object-based cloud storage service. Configuring a SparkSession or SparkContext with your S3 credentials makes them accessible to Hudi. Another way would be to let Hudi fetch the configuration from your core-site.xml.
Starburst offers an even simpler option through Great Lakes, the connectivity process within Starburst Galaxy. Just create a table with a Hudi type declaration. Great Lakes takes care of the rest.
How Hudi compliments Spark or Trino
Combining Hudi with interactive SQL query engines like Apache Spark or Trino (formerly Presto) reduces analytical workloads and lets users get quick results from complex queries. Hudi builds an efficient, scalable data repository on affordable object-based cloud storage services. Trino and Spark provide computationally efficient, massively parallel query capabilities to analyze multiple petabyte-scale datasets.
Before the Trino project developed its Hudi connector, data teams were limited to the Hive connector, which didn’t provide access to Hudi’s metadata table and performance optimizations. The Trino integration is limited to read-only mode, so Spark is often used to create a robust link between Hudi and Trino.
What are the benefits of using Apache Hudi with Trino for a data lake architecture?
Hudi and Trino combine to create a modern data lake architecture for startups and enterprises alike.
Cost-efficient: Rather than using expensive, proprietary solutions, Hudi+Trino data lakes use commodity computer and storage infrastructure from providers like AWS, Azure, or Google Cloud.
Open source: Open file and table formats prevent vendor lock-in by guaranteeing data portability.
Performant: Trino’s massively parallel query engine and Hudi’s query optimizations speed discovery while minimizing compute resource consumption.
Simplified: Enabling near real-time analytics reduces reliance on proprietary solutions and further consolidates enterprise data within the data lake.
By using Starburst Galaxy, a modern data lakehouse analytics platform, you can enhance your Apache Hudi benefits even further.
Unified data: All enterprise data sources, not just the data lake, are accessible through Galaxy’s single point of access.
Performance: Smart indexing and caching can quadruple performance at a fraction of the compute cost.
Governance: Galaxy’s unified governance layer centralized role-based and attribute-based access controls and policy creation.