Data Ingestion
The data ingestion process lays the foundation for big data analytics. This guide will introduce the ingestion process and how these workflows fit within an open data lakehouse architecture.
Data ingestion vs data integration
Ingestion comes before data integration, bringing data from different sources into a central repository like a data lakehouse. This data remains in the original format of its source, whether the clickstreams of an e-commerce platform or sales data from operational apps.
Integration transforms this landed data to make it more suitable for use. It may address data quality issues, apply consistent data formats across data from various sources, or aggregate data to meet business requirements.
Data ingestion vs ETL
Extract, transform, and load (ETL) and extract, load, and transform (ELT) pipelines are data transformation methods used in the integration stage. Left in their raw state, ingested data from different source systems are too inconsistent for end users to analyze without significant help from data engineers. ETL and ELT pipelines introduce order to make data sets more usable. For example, a pipeline may apply a consistent schema to unstructured data. These pipelines may also transform data to improve data quality, provide enriched metadata, and more.
Two main types of data ingestion with big data
Data management teams can choose from two types of data integration: streaming or real-time data processing and batch data processing.
What is batch ingestion?
Batch ingestion is the traditional approach to bringing large data volumes into a repository, such as a cloud data warehouse or a data lake. Scheduled or event-driven pipelines import data for integration. This process can take hours, usually in the form of overnight runs. Batch latency can get even worse should something go wrong. Any failure interrupting the run could require a fresh attempt that must compete with other scheduled workloads.
Batch ingestion’s fault-sensitivity could delay access days as the data steadily ages. Modern batch processing approaches introduce fault-tolerant execution, allowing pipelines to start from where they left off in the ingestion process.
What is real-time ingestion or streaming ingestion?
Change data capture (CDC) systems, industrial Internet of Things (IoT) devices, and other enterprise systems constantly generate data. Tapping into these sources as the data appears can make companies more responsive to near-term changes.
Data ingestion tools must be able to collect this source data with sufficiently low latency to meet the particular business need. However, delays are inevitable. Stream ingestion methods quickly bundle real-time data into microbatches, possibly taking seconds or minutes to make data available. That’s enough time for automated marketing systems to present promotions to online customers. On the other hand, financial trading and other time-critical applications require real-time ingestion tools that reduce this latency to fractions of a second.
Data ingestion process for near real-time data analysis
Conventional streaming data ingestion pipelines start with a distributed event streaming platform like Apache Kafka that collects events generated by the real-time data source. A stream processing framework, such as Kafka Connect, reads the Kafka events and writes them to the data lakehouse’s object storage. From there, a processing engine like Apache Spark must convert the JSON or Avro-formatted event data files into a columnar format like Apache Parquet while cataloging the files’ metadata.
What are the challenges of real-time data ingestion?
Making streaming data available close to real-time makes organizations more responsive. Supporting this agility, however, requires significant resources for development and maintenance.
In addition to resource commitment, these complex pipelines create opportunities for data duplication and other errors that undermine quality and delay usability.
Data quality at ingestion
A particularly challenging ingestion issue occurs when the arriving data is not formatted correctly or does not meet quality expectations. Transforming the invalid data is the only way to correct these issues. Otherwise, the dataset becomes too difficult for end users to access. However, that transformation may sacrifice the raw data’s potential value for future analytics.
Data ingestion framework with Trino
Starburst Galaxy leverages Trino’s massively parallel query engine and Iceberg’s open table format to provide a managed Kafka ingestion solution. This framework addresses the challenges of ingesting data from real-time sources to give analysts, data scientists, and decision-makers access to data in near real-time.
For example, Galaxy places landed, invalid data, and subsequent transformations into Iceberg tables. The raw data remains in object storage in its original condition, preserving its potential. Thanks to Trino’s support for ANSI-standard SQL, any transformations are discoverable and queryable.
Data ingestion tools and architectural overview: Kafka, AWS, Iceberg, Galaxy
Starburst’s data ingestion framework runs on Amazon’s AWS ecosystem. Amazon Kinesis Data Analytics hosts Apache Flink, which collects events from a Kafka data source. Flink then writes the data to an Iceberg-formatted AWS Glue table in an Amazon S3-based data lakehouse.
Flink’s exactly once guarantees provide fault tolerance and prevent data duplication. Using Iceberg tables avoids the complexity of storing and synchronizing evolving schemas in disparate systems. Iceberg’s support for transactional operations also improves data access performance. For example, Iceberg can compact the small files that Flink sends without disrupting the lakehouse’s read workloads.
Starburst Galaxy enhances the Trino data lakehouse analytics engine, allowing near real-time queries of the streaming data.
A fully managed, end-to-end lakehouse platform built on open-source Trino and Iceberg
Companies like Apple, Netflix, and Shopify combine Trino and Iceberg to create a class of open data lakehouse called the Icehouse. This architecture is optimized to meet the demands of machine learning, artificial intelligence, and other data science applications. At the same time, Icehouses offer the accessibility and data variety that business intelligence analysts need to support agile decision-making.
Starburst Galaxy’s implementation of the Icehouse integrates Trino and Iceberg more deeply. Price and performance optimizations, user accessibility, and governance become more straightforward to implement and automate. Galaxy streamlines your end-to-end data analytics architecture, making ingestion more reliable and empowering users to make more informed data-driven decisions.