It’s time to talk about data pipelines, specifically data ingestion best practices. Typically, a data engineering pipeline has three stages:
- Data Ingestion
- Data Transformation
- Data Consumption
Each stage serves a specific purpose. As the critical first step in any data pipeline, data ingestion is arguably one of the most important steps, establishing the initial flow of data into the pipeline.
So what is the best way to optimize data ingestion in your data pipeline?
This article explains what data ingestion is and the best practices surrounding it. Throughout, I’ll use an Icehouse architecture as the framework to achieve all of these goals.
What is data ingestion?
Data ingestion is a process that collects data from source systems and lands that data in target systems either in batches or through a streaming process in near real-time.
To achieve this, a data pipeline needs to copy data from a source database (OLTP or RDBMS) into either a data lake or cloud data lakehouse. The Icehouse is just a special type of lakehouse that uses Trino and Apache Iceberg (more on that later). This copy is known as the bronze zone, and marks the first stage of the data pipeline. From there, data moves towards transformation and the silver and gold stages.
Although data ingestion might sound as simple as copying the data from one system to another, in reality, it is a critical stage in the data pipeline that establishes the structure and validity of data needed for downstream data transformations. Without it, data can’t enter your data lakehouse from its designated source system.
Types of data ingestion: batch ingestion vs. streaming ingestion vs. change data capture
Data ingestion comes in two main forms, batch ingestion or streaming ingestion. In production, each method is often used for different purposes.
Batch Ingestion
Batch ingestion is the process of collecting and transferring data at scheduled intervals. Batch ingestion has been the mainstay of data ingestion for many years due to its simplicity and efficiency in handling large volumes of data.
In batch ingestion, data is typically gathered, processed, and loaded in bulk. This can simplify the management of data workflows and reduce the load on systems during off-peak hours.
Compared to streaming ingestion, batch ingestion is the more simplistic solution, making it ideal for scenarios where real-time data processing is not necessary. It is particularly useful for:
- Periodic reporting
- Data backups
- Historical data analysis
Streaming ingestion
Streaming ingestion involves the continuous collection and transfer of data as it is generated. Instead of waiting for data to build up to be transferred as a batch, data is sent almost as soon as it is ready.
Because this method is more complex to implement, it is usually only employed when it is necessary, usually in scenarios where capturing fast changes in the dataset is important to downstream data consumers.
This includes:
- Near real-time monitoring and alerting
- Live data feeds
- Event-driven applications
Change Data Capture
A third option is change data capture. This approach monitors changes in the dataset and keeps track of any updates or changes as they occur. When a certain threshold of change has been reached, these updates are sent from the source system to the analytic system for ingestion, in our case the data Icehouse.
Change data capture helps ensure that modifications in the dataset are always reflected using an efficient method of data movement. Like streaming, this approach is particularly important for high velocity data.
3 data ingestion best practices
What’s the best way to put all of this into production? To ensure efficient data ingestion to your own data lakehouse, try implementing these three best practices for success.
1. Use Apache Iceberg
Apache Iceberg is an open data lakehouse table format that has recently been adopted as the de facto standard in the lakehouse industry following the dual announcements from Snowflake Polaris and Databricks’ acquisition of Tabular.
Although Iceberg competes directly with Delta Lake and Hudi–the other data lakehouse table formats–and with the older data lake table format Hive, Iceberg has significant advantages over each of these alternatives, and over traditional data warehouses.
Overall, Apache Iceberg includes two key differences that allow for significant benefits:
Apache Iceberg Difference | Benefit |
Operates best on inexpensive cloud object storage of your choice:
|
This approach comes with significant cost savings and improved
flexibility compared to data warehouses. |
Collects enhanced metadata compared
to Hive |
This approach allows for features and functionality more akin to a data warehouse, including time travel, schema evolution, and ACID compliance. |
For many users, Apache Iceberg offers the combination of inexpensive, flexible storage with feature enhancements, making it a natural fit for most datasets. Ingesting into Iceberg also allows you to take advantage of these features all the way through the data pipeline.
2. Use a versatile workload configuration
A good data ingestion strategy allows for multiple pathways for streaming ingestion, including:
- Batch
- Streaming
- Change data capture
By approaching data ingestion in a way that allows for all forms of ingestion, you leave your pipeline open to adaptation and growth. Data that should be added through batch processes, like low velocity data, can be added in that way. Meanwhile, data that should be streamed, like high velocity data, can be added that way.
3. Perform data quality checks
Data quality checks help ensure that the data being ingested is accurate, consistent, and reliable. This is an important part of the ingestion process, where data will be validated against predefined criteria for completeness, accuracy, timeliness, and consistency.
This process helps detect and correct errors such as missing values, duplicates, and anomalies before they impact downstream analytics. Implementing robust data quality checks during ingestion ensures that only high-quality data is stored and processed, enhancing the overall integrity of your data pipeline and enabling better decision-making and insights.
How a Starburst Icehouse architecture supports data ingestion
The best way to achieve all three of the best practices above is to use a Starburst Icehouse architecture, allowing you to make use of Apache Iceberg, data streaming, and data quality checks.
Starburst Galaxy makes all of this possible, providing the easiest and fastest way to use Trino to discover, govern, share, and secure your data lakehouse. This approach democratizes data for all, building an open data architecture that prevents vendor lock-in.
The following video shows how Starburst handles data ingestion as part of a data Icehouse architecture.
Open Lakehouse
Icehouse Resource Center
The Icehouse is an open lakehouse with Trino as the open query engine and Apache Iceberg as the open table format.