What’s the difference between Apache Parquet vs AVRO
Cindy Ng
Sr. Manager, Content
Starburst
Cindy Ng
Sr. Manager, Content
Starburst
Share
More deployment options
The modern data lakehouse combines Apache Iceberg’s open table format, Trino’s open-source SQL query engine, and commodity object storage. Open file formats also influence the performance of big data processing systems. This guide will introduce two open file formats, Apache Avro and Apache Parquet, and explain their roles in petabyte-scale data processing, particularly with respect to their integration with Trino and Iceberg.
What is Avro?
Apache Avro is a data serialization format that efficiently stores records data generated by transactional and streaming data platforms.
Avro file and file format
With Hadoop’s release in 2006, enterprises had an open-source alternative to expensive, proprietary data platforms. Companies soon tried to extend the Hadoop ecosystem to handle transactional data. Still, they needed a better file format than Hadoop’s Row-Columnar File (RCFile) or its replacement, Optimized Row-Columnar (ORC). Released by the Hadoop project in 2009, Avro has since become an Apache Software Foundation top-level project.
Row-oriented file formats
Transactional systems process large volumes of individual records and must write and read entire records in real-time. Row-oriented file formats write the data elements in each record to the same location, allowing sequential data reading for rapid retrieval. Additionally, row-based storage is more efficient.
Apache Avro framework
Avro files have two components: a file header and at least one data block. Although some use cases benefit from JSON encoding, Avro data blocks typically use binary formats for performance and efficiency. Binary encoding strips away redundant information already recorded in the header file, such as field names and record separators.
Headers use JSON to define the Avro file’s metadata, including its schema definition. Embedding the schema makes each Avro file self-describing and offers several benefits. For example, it streamlines schema evolution. Since the old schema is part of the Avro file, its field names are always available for resolving differences with the new schema.
Avro schemas define both primitive and complex data types, significantly reducing the amount of data stored in the blocks.
In addition, using JSON to define file schemas means Avro supports many programming languages, including Java, Python, and Ruby. As a result, reading or writing Avro files does not require code generation.
Avro use cases
As a row-based format, Avro is the best choice for transactional records processing applications as well as streaming systems like Kafka. It stores data more efficiently than text-based files like JSON or CSV and is easier to integrate into high-throughput transactional workflows.
What is Apache Parquet?
Apache Parquet is a column-oriented format that accelerates query performance and uses column-specific data compression and encoding to reduce storage consumption.
Parquet file and file format
RCFile, and later ORC, offer several optimizations for Hive-based data processing platforms. In 2013, developers at Twitter and Cloudera released Parquet as a more efficient, platform-agnostic column-oriented file format. The file format has since become an Apache top-level project.
Column-oriented file formats
Analytics systems differ from transactional systems in several ways. For instance, write operations are not as important as read operations. A data warehouse writes data once at ingestion and then queries read the data multiple times a month, day, or minute. Furthermore, analytical querying processes data in one column at a time, often not needing to access data in other columns. Since fields are more important than records, analytics systems need a better-suited file format.
Rather than storing the data elements in each record in one place, columnar storage formats write all data elements from a given column together. Queries can quickly retrieve the data they need without touching other columns.
Apache Parquet framework
By storing columnar data together, Parquet files can apply type-specific encoding and compression formats to each column. As a result, Parquet files make the most efficient use of a storage system, whether on-premises or in the cloud.
Parquet files support a number of features to streamline data analytics. For example, queries can read Parquet’s metadata and skip irrelevant data using predicate pushdown. Parquet also supports schema evolution. Columns can be changed or appended without requiring a rewrite of existing data.
The Parquet format supports many big data processing frameworks and programming languages. Besides Hive, Apache Impala, Apache Spark, and Trino are popular processing frameworks that handle Parquet files. Developers can process Parquet files using Python, Java, C++, and other languages.
Parquet use cases
Parquet is widely-adopted by organizations needing to store and process large datasets. Data warehouses, data lakes, and other central repositories are ideal candidates for Parquet. Parquet also makes data management more efficient by optimizing query workloads in ETL data pipelines.
Apache Parquet, Avro, and big data
The need for formats like Parquet and Avro arose from the inexorable increase in the volume, velocity, and variety of data generated in the Big Data era. Data engineering teams require file formats that are easy to process to minimize compute costs yet also make the most efficient use of their organizations’ data storage infrastructure.
These choices become more important as companies seek to modernize their data architectures and migrate away from their legacy Hadoop systems.
Avro vs Apache Parquet
Choosing between Avro and Parquet depends on how best each format optimizes a given application. Row-based and columnar formats have distinct strengths, so it comes down to whether record processing or data aggregation is more important.
Shared benefits
Despite their differences, these modern file formats have a few similarities. For example, Avro and Parquet support schema evolution and work with multiple programming languages and data processing architectures.
While both compress their data using algorithms like Snappy, Avro’s rows can contain many different data types, which limits the potential gains. Since Parquet applies algorithms specific to the data type in each column, it can squeeze more data into the same storage space.
When to choose Avro
Avro’s row-oriented files are the best choice for online transactional processing (OLTP) systems that process entire records and need balanced read and write performance.
The self-describing file format excels in these dynamic environments where schema changes are more common.
When to choose Parquet
Parquet excels as a file format for data analytics platforms. Its column-based structure lets queries return results quickly without having to scan through irrelevant data. The file format’s efficient approach to data compression makes Parquet particularly useful for on-premises data warehouses where storage infrastructure cannot scale affordably.
Integration with Trino and Iceberg
Avro and Parquet also integrate with Trino and Iceberg to play a role in the open data lakehouse. This relatively recent architecture brings the analytics performance of a data warehouse to the cloud scalability of a data lake. Unlike these legacy architectures, the data lakehouse does not attempt to be a central repository for all enterprise data. Instead, it consolidates just the critical, frequently accessed data in object storage. Trino’s connectors abstract all other data sources to create a virtual access layer and make data architectures accessible from a single interface.
Faster SQL queries
Trino queries use ANSI-standard SQL, promoting accessibility so anyone can analyze data from the command line, in code, or through their preferred business intelligence application.
Trino leverages the metadata stored within data files to accelerate queries further. Dynamic filtering, for example, reduces the amount of irrelevant data a pushdown query returns, thereby improving query performance, reducing network traffic, and easing the query workloads on the data source.
Streamlined data pipelines
Engineers can use Trino to build their data pipelines and ingest datasets in row or column formats.
When building ingestion pipelines from streaming sources, the incoming datasets are often in row-oriented formats like Avro. Trino’s Kafka connector applies an Avro schema while serializing data into row-based files.
When ingesting data from columnar data sources, Trino’s connectors to Delta Lake, Hudi, and Hive let queries access data written in the Parquet format.
Integration with Apache Iceberg
Iceberg is more flexible than other table formats, supporting Avro, ORC, and Parquet. As engineers design their open data lakehouses, they can choose the most appropriate format for each dataset.
As mentioned earlier, the data ingested from streaming sources is often row-oriented Avro files. An Iceberg table can store this raw dataset directly to the lakehouse without having to transform it into a more structured format. Avro’s schemas are directly accessible through Trino.
For most other use cases, Parquet is the preferred file format. Starburst’s Trino-based open data lakehouse analytics platform has an enhanced Parquet reader that provides a 20% boost to read performance over Trino. We also extend Trino’s dynamic filtering to the row level to reduce data processing workloads further.