ACID Transactions
This introduction to ACID transactions explains how ACID properties protect data integrity and how ACID compliance has become a feature of modern data lakehouse platforms.
ACID transaction properties
The acronym ACID stands for Atomicity, Consistency, Isolation, and Durability. These four ACID properties define how database transactions should behave to ensure data remains in a consistent state even in the event of a system failure.
Atomicity
Atomicity treats an entire transaction as a single unit of work. Database transactions typically involve multiple database operations. If any of them fails, partly or completely, then the transaction’s final result may be incorrect.
Atomic transactions require the successful completion of each operational step. If part of the transaction fails, the entire transaction fails and must run again.
For example, transferring funds from one bank account to another involves subtracting the transfer amount from the first and adding it to the second. If either step fails, then at least one account balance will be wrong. An atomic transaction facing a similar operational failure would not go through at all.
Consistency
Consistency is the enforcement of business rules and data integrity constraints that govern how transactions change a database’s state. The database would become unreliable if transactions were allowed to violate these restrictions. There would be no way to tell whether any two data values were comparable.
For example, a withdrawal amount exceeding an account balance would violate a bank’s rule banning overdrafts. Consistency prevents the transaction from completing, rolling the system back to its previous state.
Isolation
Concurrent transactions, those interacting with the same data at the same time, could undermine data integrity. The ANSI/ISO SQL standard describes three situations that occur depending on the timing of two transactions’ operations:
Dirty reads: Transaction A has updated a row but has not yet committed the change when transaction B retrieves the row.
Non-repeatable reads: Transaction B retrieves a row, transaction A commits updates to the row, and transaction B retrieves the row again.
Phantom reads: Transaction B retrieves a set of rows, transaction A inserts or removes rows from that set, and transaction B retrieves the set of rows again.
ACID-compliant systems use concurrency controls to isolate transactions from each other. Lock-based controls force a new transaction to wait for the current transaction to complete. Multiversion controls use snapshot isolation, letting the new transaction act on the current state while the other transaction works. The control commits the new transaction’s changes provided there’s no conflict; otherwise, it rejects the transaction.
Database systems offer different isolation levels that balance isolation and performance. These levels dictate how read and write operations occur and whether to allow phantom, non-repeatable, or dirty reads.
Durability
Durability simply means that once a transaction commits its changes, those changes become part of the database’s permanent record, even in the event of a power outage or other system failures. Database systems usually achieve durability by moving in-memory data to non-volatile storage.
How ACID transactions work in analytical databases
ACID compliance is a feature of online transactional processing (OLTP) systems as well as database management systems (DBMS) like MongoDB, MySQL, or Oracle Database. However, these systems store data in row-based schemas that are not conducive to query efficiency. This performance issue was one of the drivers of analytical databases like data warehouses. By using columnar schemas, these systems could read data quickly.
Since data warehouses evolved from on-premises databases, they inherited ACID properties. Even with new, cloud-based methods for managing enterprise data, data warehouse vendors could promote the advantages of ACID compliance.
Data lake vs data lakehouse
Data warehouses have not kept up with the expanding diversity and volumes of data enterprises need to power increasingly sophisticated analytics. Open-source technologies and commodity-priced cloud storage services like AWS let data lakes provide cost-effective alternatives to proprietary warehouses.
Still, data lakes by themselves cannot match a warehouse’s data management and analytics capabilities, leading to complex and expensive workarounds that often involve layering one or more data warehouses between the data lake and users.
Data lakehouses solve this problem by combining the full capabilities of a data warehouse with a data lake’s cost-effective scalability.
Related reading: Data warehouse architecture vs lake architecture vs lakehouse architecture
Open table formats: Why are ACID transactions a good thing to have?
Open source file formats like Apache Parquet and table formats like Apache Iceberg let companies build performant, scalable data lakehouse analytics platforms with data portability not found in proprietary warehouse solutions.
Modern open table formats support ACID properties, allowing query engines to add, remove, or change tables without compromising data integrity.
Transactional support and ACID transcation compliance
ACID compliance lets a data lakehouse support use cases beyond pure analytics. The lakehouse won’t replace OLTP systems. However, they can replace database systems used in decision-making processes subject to audits where data consistency, accuracy, and integrity are essential.
Icehouse architecture
The Icehouse architecture is one reason open data lakehouses are on the upswing. It combines two open technologies, Iceberg and Trino, to create ACID-compliant data lakehouses that go beyond analytics to support mission-critical data management and decision-making processes.
Trino
Trino is a massively parallel processing query engine that can access petabyte-scale datasets from distributed sources. Originally designed for petabyte-scale interactive data analytics, Trino’s support for ANSI-standard SQL and dozens of connectors to enterprise data sources have made the query engine an essential data management resource.
Iceberg
Apache Iceberg’s metadata layer turns a data lake into an ACID-compliant data lakehouse. For example, Iceberg creates snapshots of data tables that isolate concurrent users. This open table format also provides schema evolution, time travel, and performance features usually associated with a traditional DBMS.