Parquet and ORC’s many shortfalls for machine learning (ML) workloads, and what should be done about them
Share
More deployment options
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads. These systems store data from tables row-by-row, so that all data found in the same row are located together. Since each row corresponds to real-world entities (or relationships between those entities), such as users, employees, products, orders, or other read-world “things” or “objects”, storing data row-by-row stores all attributes about those entities together (see figure below).
The transition from row-based to columnar data storage
Storing data row-by-row isn’t the only option. It can also be stored column-by-column (e.g. all names together, all addresses together, etc.). There has always been a niche in the database industry that sold column-oriented database systems (see Chapter 2 from one of my publications from 2013 which goes through the long history starting from the 1970s). However, the early columnar solutions were slower than row-oriented systems for the vast majority of real-world workloads.
This all changed in the early 2000s with the introduction of new approaches to column-storage via projects such as C-Store (my PhD thesis), X100 (from CWI), and other efforts in academia and industry, including:
- Vertica, which commercialized C-Store
- VectorWise, which commercialized X100
- ParAccel, which eventually turned into RedShift
Collectively, these efforts launched a major paradigm shift in which column-stores became an order of magnitude faster and thus far more useful for practice. They achieved this performance improvement by leveraging the reduced data entropy of column data (which is all from the same attribute domain) to get better compression ratios. To compound this benefit, they introduced techniques that enable direct operation on compressed data, so that not only are the compression ratios better, but also the data never has to be decompressed during query processing. Furthermore, they leveraged SIMD instructions to operate on multiple column values in the same parallel CPU step. Beyond this, there were many other important performance contributions, perhaps the most important of which was to utilize shared-nothing architectures to partition column data across a cluster of machines, thereby improving the parallel processing of data at query time.
The rise of Apache Parquet and Apache ORC
The movement towards column-oriented data has led to an enormous shift in the data management industry over the past two decades. We have reached a point where every major database vendor now includes some form of columnar storage at least as an option (and in some cases, the only option), and although a full analysis of the columnar-database market share is beyond the scope of this piece, anecdotally, it appears that the majority of workloads that analyze large amounts of data are backed by column-oriented storage and use column-oriented processing techniques.
In 2013, two major open source Apache projects, Parquet and ORC, were released, in which many of the technologies used by proprietary column-store database systems became freely able to be used and built upon. These projects are, at their essence, open file formats, which arrange data into columns and facilitate column-by-column access, along with implementations of these formats (e.g. parquet-java). Other open source data processing systems, including many from the major Hadoop, Spark, and Presto (–>Trino) ecosystems were designed to work directly with these open file formats for high-performance data analysis that rival the leading proprietary systems.
Today, it is almost impossible to find a data lake that does not contain large amounts of data stored using these open source columnar formats. Traditional SQL analysis performed over these datasets benefits from the column-oriented processing that these formats enable. In short, these Apache projects have been hugely successful and will likely continue to make a significant impact moving forward.
Areas where Apache ORC and Apache Parquet encounter shortfalls
Despite all this success, Parquet and ORC remain based on their academic forerunners, such as C-Store. All of these technologies were designed for traditional SQL data analysis in which workloads are dominated by scanning through large amounts of data, joining it with other data, and performing sorting, grouping, and (usually simple) aggregations such as sums, averages, and counts. Meanwhile, analytical data processing workloads have become far more advanced, incorporating machine learning (ML), data classification, and feature engineering tasks for which these data formats were not originally designed to target. Furthermore, new laws, such as GDPR, CCPA, and CPRA have imposed data compliance demands (especially around data privacy) that were not originally envisioned when developing the original column-store standards. We discuss below some issues that Parquet and ORC run into for these workloads and requirements.
The problem of wide and sparse columns in column-oriented file formats
Existing open column-store formats are still used heavily in the ML space, and they still provide much value. However, they often significantly under-achieve the theoretical optimal performance that column-oriented storage should be able to attain due to some of the extreme characteristics often found in ML contexts. For example, tables used as input to machine learning algorithms are often both extremely wide and sparse, and often contain a broad spectrum of features, including those in beta, experimental, active, and deprecated stages, leading to feature counts that approach 20,000 in size — each represented by a different column in the input table. In theory, column-stores are the ideal solution for such wide tables, and indeed, this is why Parquet and ORC are used heavily for machine learning datasets. Many training tasks access 10% or less of the features, and column-stores are able to focus the data-read time on only the necessary features.
However, there are also performance pitfalls. The leading implementations of Parquet and ORC both need to deserialize and parse metadata to find these specified columns. For extremely wide and sparse tables, this metadata read time can actually dominate query time. The figure below displays the results of an experiment we ran for a CIDR paper that we just published that measures the time it takes Parquet (parquet-java) to extract a single column from a dataset consisting of a variable number of features. Parquet’s performance is significantly dependent on the number of feature columns, with retrieval time increasing linearly with the number of features. As tables get wider and sparser, this metadata read time becomes more significant and more problematic. These results are supported by other similar findings from other groups, and although it may be possible to alleviate these performance issues with some engineering tricks, ultimately the problem is best solved by an architectural redesign.
For example, the experiment also shows the performance for an alternative design (Bullion) that enables direct metadata access from file footers (a technique used often in other contexts), that can eliminate this overhead, thereby allowing performance to stay flat at less than 2ms.
These results indicate that while column-store formats such as Parquet are certainly good fits for machine learning training workloads, they would have been designed differently if they had been initially targeted to handle the extremely wide and sparse tables commonly found in such workloads.
The need for native handling of vectors
As another example, many machine learning algorithms are designed to take vectors of numbers as input. For example, the figure below shows a feature column called “clk_seq_cids” that is represented as a vector of 256 int64 elements (stored as list<int64> in Parquet) where each element represents an ad ID clicked on by a user. This feature is used for tracking user interactions with advertising campaigns over time. Since user tables are often sorted first by user id, and then by timestamp, this particular column often exhibits a sliding window pattern where successive values within the column differ only slightly, as this vector does not change significantly between time snapshots for the same user.
Clearly, a delta encoding compression scheme that is optimized for sliding windows should be used to store this column. This would not only decrease the size of the column in storage, but would also decrease IO times to read in the column and potentially decrease CPU costs for operators that are able to operate directly on delta-compressed data without first decompressing it. Unfortunately, current implementations of delta encoding in most columnar storage formats—whether open-source, such as Parquet and ORC, or proprietary formats used in cloud data warehouses—support only standard primitive types (e.g., INT, BIGINT, DATE, TIMESTAMP, DECIMAL, etc.).
Given how frequently vectors are used in machine learning algorithms, the inability of existing column-store formats to natively handle and perform similar optimizations for these types that they do for primitive types results in significant performance shortfalls. This shortfall again stems from the original target of the existing column-store formats for SQL workloads in which vectors are discouraged are rare. Nonetheless, column-storage for machine learning workloads must include optimized native support for vector types moving forward.
The problem of column-oriented data in the face of data compliance requirements
As a third example, personalization and recommendation systems collect large quantities of user data which are used to create training datasets to create detailed user profiles. However, recent AI privacy and diverse data compliance regulations worldwide, such the EU’s GDPR, California’s CCPA, and CPRA, and Virginia’s VCDPA, mandate the physical deletion of user data within specified timeframes. These deletions can cause problems for data stored using existing columnar formats for two reasons. First, since each column in the deleted row is stored separately, a single request to delete a row results in many separate modifications: one for each column in that row. Furthermore, these formats often use block-based compression, which complicates direct modifications of individual rows.
Without optimizations, deleting a single row in columnar storage may require rewriting the entire file, leading to significant I/O and resource consumption. Therefore, out-of-place paradigms are often used. For example, deletion vectors that employ bitmaps to mark rows for deletion are often used. Deletion vectors mark changes without requiring that the entire file be rewritten; instead, when data is read, the vectors are merged with the raw data being read, preventing deleted data from reaching downstream operators. Unfortunately, since this approach does not physically delete the data (rather, it merely hides it from subsequent reads), it often does not align with data compliance regulations requiring timely physical deletion. Therefore, in practice, many organizations avoid using deletion vectors or other similar out-of-place methods; and instead immediately delete data despite the high cost.
The next generation of column-store formats will thus need better support for in-place deletes than currently available in Parquet and ORC. Compression algorithms and checksums need to be used that do not require decompressing and rewriting large chunks of data just to delete a single row. Rather, dictionary encoding, bit-packed encoding, delta encoding, and run-length encoding schemes that maintain clear delineation between rows so that rows can be deleted without needing to decompress surrounding data. Although this will lead to slightly poorer compression ratios, the performance improvement of enabling in-place deletes (and updates) far outweighs this downside.
Conclusion
The above cited CIDR paper gives several other examples of areas in which Parquet and ORC fail to fully meet the requirements of modern machine learning workloads. In all cases, it is not the case that they completely fail at these workloads. In fact, we still recommend using these open formats based on what is available today (especially when Starburst is deployed, which is highly optimized for these types of open data formats). The only problem with these columnar formats isIt is just that they were originally designed for a completely different set of workloads, and certain design decisions would have been made differently if the workloads we run today were as prevalent then as they are now.
It is clear that the time has come for new open columnar formats that will be designed specifically for modern machine learning workloads. Whether this is Bullion (whose design is described in the CIDR paper mentioned above), Nimble/Alpha (from Meta) that is described in a CIDR paper from last year, or any other approach, there is simply too much performance left by the wayside when using previous generation formats.