Schema on Read

As a result, the underlying data can remain in its original raw state. This on-demand transformation is what makes a data lake such an effective platform for modern analytics.

Data analytics platforms use schemas to define table structures, properties, and relationships so query engines can process data efficiently. For example, a schema will describe the number and names of a table’s columns along with metadata to help queries find relevant information in vast datasets.

The traditional approach, schema-on-write, defines table structure at the time of ingestion, whereas the more modern approach, schema-on-read, only defines table structures during query runtimes.

This difference has a profound impact on the analytics platform’s performance and scalability, particularly given big data’s rising volume, velocity, and variety. We’ve discussed the implications of these two approaches in other posts, but never given the schema-on-write vs. schema-on-read comparison the full attention it deserves. This post will.

What is schema-on-write?

As its name implies, the schema-on-write approach applies structure to data as it’s written to storage. It came into use with the first relational database management systems (RDBMS) and continues today in products like IBM Db2 and MySQL. Data warehouses evolved from the RDBMS world, so services like Snowflake use the database schema approach.

Databases and data warehouses depend on consistency in data formats, codings, and other properties to make analysis as straightforward as possible. These systems handle this upfront by transforming data during ingestion to comply with the target table’s predefined schema.

Making schema-on-write work

Defining schemas is an essential aspect of database design. Architects must identify all the different ways data consumers may want to look at the data and create table schemas that support these uses.

Work doesn’t stop once the database is complete. The ETL data pipelines that populate the repository require time-consuming development and maintenance. Data engineering teams must fully understand the contents of data sources to ensure data modeling accurately identifies the necessary transformations. Any change in the source data could disrupt the pipelines, so engineers must constantly monitor the end-to-end process.

Schema-on-write disadvantages for analytics

All of that work goes to delivering an accessible resource for data consumers, yet schema-on-write has inherent limitations, including:

Inflexibility: Schemas freeze a company’s analytics capacity by dictating the data analysis it can perform. New questions require more complex queries and data pipelines to workaround schema inflexibility.

Data poverty: Transformations applied at ingestion strip meaning from the incoming raw data. As a result, analysts may miss insights that could improve business performance.

Data monotony: Databases and data warehouses, with their rigid schemas and data structures, aren’t set up to handle the types of data flowing from today’s operational systems.

How Apache Iceberg relates to schema-on-write

Some organizations are replacing their warehouse vendor’s proprietary table format with Apache Iceberg’s open-source table format to reduce data warehousing costs and limit exposure to vendor lock-in. For example, Snowflake now lets its customers access Iceberg tables and data stored on cloud platforms. Several features of Iceberg support schema-on-write to keep data compatible with the data warehouse.

Schema enforcement

Data pipelines can use an Iceberg table’s schema during validation to enforce schema-on-write and identify non-compliant data for rejection or additional transformation.

Schema evolution

Iceberg tables use metadata to define their schema, so changes don’t require rewriting the entire data table.

Data partitioning

Hidden partitioning makes split planning transparent, so table partition changes do not impact queries.

What is schema-on-read?

Schema-on-read approaches only apply a schema when a query accesses a table. Any required transformations happen at runtime. As a result, the underlying data can remain in its original raw state. This on-demand transformation is what makes a data lake such an effective platform for modern analytics.

Schema-on-read advantages for analytics

Since a schema only goes into effect at runtime, tables with different schemas can use the same raw data.

Flexibility: Should new questions arise that existing tables can’t support, engineers create new tables without duplicating or re-ingesting source data.

Data richness: Analysts can extract more meaning from raw data than from aggregated and transformed data forced into a static schema.

Data variety: Freed from any predefined schemas, data lakes can store a broader variety of unstructured data that may lead to deeper insights.

Streamlined management: The data lake’s extract, load, transform (ELT) model almost eliminates complex, brittle ETL data integration pipelines that burden data warehouse management.

Scalable affordability: By allowing data to remain in its raw state, schema-on-read lets data lakes leverage low-cost object storage services like Microsoft Azure.

Using schema-on-read with Iceberg

Although Apache Iceberg can support data warehouses, the open table format comes into its own when implemented on a data lake.

At an architectural level, Iceberg gives data teams optionality. They can use the open file format most appropriate for their organization, whether it’s Avro, ORC, or Parquet. Likewise, Iceberg lets engineers select among open SQL query engines, including Apache Spark and Trino, to get the best performance for any given project.

By implementing data structure as metadata, schema-on-read Iceberg tables are infinitely more flexible than the static schema-on-write structure of a data warehouse. Data scientists and business intelligence analysts can use the same raw data with different Iceberg tables optimized for their specific analytics needs.

Schema on read examples

Schema-on-read enriches an organization’s data architecture, as seen in the following use cases:

PostgreSQL to data lake

New companies will run their analytics on their operational databases — more often than not, systems built on PostgreSQL. They’ve already invested in these relational databases. Plus, their data engineers know how to use PostgreSQL and have enough on their plates without learning a new data analytics platform.

However, PostgreSQL isn’t great for analytics. Its rigid schema, incompatibility with unstructured data, and poor query performance make RDBMS-based analytics inefficient and expensive.

At a certain point, it makes more sense to migrate analytics from PostgreSQL to a data lake. Data teams can switch from complex, brittle, and maintenance-heavy ETL pipelines to more straightforward and efficient ELT workflows. Data lake query performance is significantly faster thanks, in part, to an open table format’s schema-on-read approach. Finally, the data lake gives users more options when working with data.

Kafka and data lakes

Internet of Things, e-commerce, and other streaming data systems have become standard elements of enterprise data architectures. Real-time data management systems like Kafka often use complex, dynamic schemas that are particularly challenging for data warehouses. Although Kafka’s ecosystem provides basic analytics capabilities, they aren’t at the level needed for data-driven decision-making.

Ingesting streaming data into a data lake lets an organization analyze streaming data in near real-time. Iceberg provides the schema definition at ingestion. Moreover, the Iceberg tables can evolve with the streaming source without rewriting the lake’s existing data.

Schema discovery on a data lake

One of the challenges that data lake early adopters ran into was the effort required to find useful data. All too often, the data lake became a data swamp.

Starburst’s fully-managed open data lakehouse analytics platform, Galaxy, provides schema discovery functionality that streamlines the exploratory process. This automated discovery feature will automatically detect and organize schemas in your data lake — and not just for your existing datasets. Schema discovery automatically catalogs schemas for new datasets.

With Starburst Galaxy’s schema discovery, you can:

  • Streamline workflows
  • Break down data silos
  • Optimize data
  • Enforce governance consistently

Check out our Schema Discovery tutorial to learn more about using Starburst Galaxy to search for schemas in your object store.