The scale and speed of business today depend on automated decision-making in e-commerce, compliance, cybersecurity, and other functions. CDC is how these systems get the real-time transactional data they need. This guide will introduce CDC and explain how to integrate it into a modern data analytics platform.
Helpful to data engineers for:
Data engineers use CDC to manage their data integration processes more efficiently and support more responsive decision-making. Some examples of CDC’s benefits include:
Eliminating analytics friction
Strategy was once the primary role for advanced analytics, using insights teased from large historical datasets to guide a company’s long-term growth. Today’s data-driven decision-making cultures apply analytics in near real-time.
Traditional data management practices can’t keep up. Overnight data pipeline batches take hours to complete. CDC makes data accessible instantly without compromising data quality, consistency, or accuracy.
Optimizing data resources
Batch data replication is extremely budget and resource-intensive. For example, copying datasets from on-premises database systems to cloud data stores congests on-prem networks and generates costly data transfer fees. Processing and storage costs also go up. Yet, most of the source database table hasn’t changed.
CDC limits data processing and motion to changes in the source database. Ingestion pipelines only handle the new data, which significantly reduces resource utilization and costs.
Minimizing operational impacts
Transactional and analytics data platforms have distinct performance criteria requiring different optimizations. Engineers must minimize the burden analytics systems place on operational systems. For instance, batch processes can lock source tables, making them unavailable for new transactions until the batch run ends.
Change data capture’s incremental processing transfers changes as they happen without impacting data sources.
Streamlining near real-time data analytics
Machine learning algorithms and other automated decision-making systems depend on having the right data in the right place at the right time. Supporting these time-sensitive applications requires ingesting data from real-time sources, which simply isn’t possible with traditional batch processing.
CDC lets engineers tap into the real-time data streams of IoT devices, e-commerce systems, and other operational sources.
CDC use cases
The speed of business today requires responsive decision-making that increasingly happens in real time. In the past, this would require armies of analysts and considerable expense. Now, advanced analytics lets companies make routine decisions automatically, giving decision-makers the bandwidth to focus on higher priorities.
Change data capture makes this new mode of decision-making possible by supporting near real-time use cases such as:
Internet of Things (IoT) devices
Enterprise IoT devices generate vast quantities of streaming data. To avoid processing redundant data, CDC workflows filter IoT streams so that only the changes pass to analytics and decision-making systems.
Compliance and fraud detection
Change data capture lets companies quickly collect and evaluate financial, network, and other transaction log files. By spotting changes quickly, administrators can enforce internal governance policies and ensure regulatory compliance.
Artificial intelligence and machine learning
Too much data flows through the modern enterprise for people to handle without help from artificial intelligence and machine learning algorithms. These systems can instantly react to nuanced data patterns — but only when data arrives in real time through low-latency CDC workflows.
Customer 360
Developing a 360-degree view of consumers lets today’s omnichannel retailers fully understand customer behavior to maximize sales. Change data capture lets marketing departments deliver individually tailored promotions in near real-time to increase conversion rates and improve the customer experience.
ETL data pipelines and CDC with data warehouses
Data teams have traditionally relied upon extract, transform, load (ETL) data pipelines to update data warehouses. These processes extract data from the source database into a staging area, transform the data to meet warehouse schema and other requirements, and finally load the transformed data into the warehouse.
Ingestion pipelines are time-consuming and resource-intensive, so they tend to run during periods of low demand as overnight and weekend batches to minimize impacts on other applications.
Since traditional ETL batch processing’s high-latency nature can’t support real-time demand for analytics, data teams must choose from a number of CDC design patterns to keep data warehouses current around the clock.
Trigger-based CDC
Inserted, updated, or deleted records in a source database trigger procedures that copy changes to the database’s change table. Target systems then extract data from the change table. Trigger-based CDC captures any kind of change, but adds overhead to the source database since each database table needs its own triggers. In addition, running database triggers increases the number of writes, which may undermine operational performance.
Log-based CDC
The transaction logs databases use for recovery in the event of a crash record insertions, updates, and deletions as they happen. Log-based CDC reads those transaction files to capture the same changes as trigger-based CDC without impacting database performance. However, database vendors use proprietary file formats that add development overhead.
Timestamp-based CDC
With timestamp-based CDC, the source database uses table schemas that include a column to document the timestamp of the last change. This column is queryable, letting the CDC system discover and extract changes made following the last query. However, the source system yakes a performance hit since the CDC query must scan every row. In addition, timestamp-based CDC may not catch all deletes without additional error-checking routines.
CDC challenges with data lakes
Among the many advantages data lakes have over warehouses is their ability to store large quantities of raw data regardless of type. As a result, ingestion is much simpler. There’s no need for transformations to conform data to a preset schema. An extract, load, transform (ELT) framework leaves data as-is until needed, applying transformations applicable to specific use cases.
However, data lakes run on immutable cloud object storage services like Microsoft Azure Blob Storage and Amazon S3. Tables can’t be updated, only appended. Feeding a stream of changes would force the data lake to constantly rewrite its tables.
CDC with open data lakehouses
The growing popularity of open data lakehouse architectures makes ingesting the real-time changes captured by CDC workflows much easier. Open data lakehouses combine the efficient affordability of cloud object storage service with Trino’s open SQL query engine and Iceberg’s open table format.
Open-source Trino query engine
Trino is a massively parallel SQL query engine designed for running interactive queries at scale. Unlike the walled garden of a proprietary data warehouse, Trino’s connectors federate multiple data sources, from data lakes to relational databases, to create a distributed analytics platform. Trino has evolved beyond ad hoc analytics with features like fault-tolerant execution that make the query engine a powerful means of running ETL/ELT data pipelines.
Apache Iceberg open table format
Apache Iceberg tables consist of metadata files that describe where their data is stored and the table’s organization. Any change to an Iceberg table results in a new snapshot of the table’s metadata files. This metadata-based approach avoids the processing impact of object storage immutability. A table could have terabytes of data, but only the much smaller metadata files get rewritten. Iceberg’s snapshots also allow for powerful data lakehouse management features like schema evolution and ACID transactions.
Apache Iceberg, Trino, and CDC
Companies can build CDC workflows that use Iceberg and Trino to bring streams of transaction logs into their data lakehouse.
In this example, data from Amazon DMS lands in S3 as staged CSV files. The dbt model incrementally processes new CDC records from DMS to create multiple CSV update files, which are then merged into a target Iceberg table using Trino’s SQL query capability.
This workflow is made possible by Trino’s Iceberg connector, which enables full data manipulation language (DML) for updates, deletes, and merges, significantly simplifying CDC ingestion pipelines. In addition, Iceberg’s optimization capabilities will merge the many small files CDC workflows generate into larger files for more efficient querying.
CDC into data lakehouses with Starburst
Another option is to use Debezium, Apache Kafka, and Starburst to bring CDC data into the data lakehouse from an operational source like PostgreSQL. Debezium provides real-time detection and capture of changes logged by the PostgreSQL database, transmitting those changes to a Kafka topic. In turn, Kafka lands a series of change files in S3’s object storage. Starburst’s schema discovery creates an Iceberg table based on these files, allowing the database changes to be merged into the data lakehouse.
Starburst streaming ingest
Starburst streamlines CDC even further with Starburst Galaxy’s fully managed streaming ingestion solution. Rather than writing custom code to handle orchestration and schema evolution, engineers can set up a CDC ingestion workflow in three steps through the Starburst user interface.
Streaming ingestion continuously reads the Kafka topic’s messages and writes the data to an Iceberg table in the data lakehouse with full ACID compliance. This latest enhancement to our data lakehouse analytics platform alleviates the challenges of maintaining CDC pipelines while making CDC data available for analytics in near real-time.