Open Data Warehouse

Companies use open-source frameworks, particularly with Apache Iceberg tables, to build enterprise-class data analysis solutions that are more affordable, scalable, and appropriate to their specific use cases.

Open-source data warehouses give enterprises cost-effective data portability and scalable query performance that proprietary systems don’t. Although going open source increases data team responsibilities, it also allows more control over the data powering innovative decision-making.

This guide will introduce the open-source data warehouse, explain its components, and show how Starburst’s open data warehouse analytics platform evolves analytics beyond a centralized data repository.

An open data warehouse is often seen as a combination of an SQL query engine and a data lake

Data warehouse vendors provide storage, management, and analytics in one package, which imposes tradeoffs. Companies work with a single vendor that takes responsibility for its product’s development and provides advanced data warehousing tools. However, these vendors can’t match the hyper-scale capabilities of commodity cloud service providers. In addition, vendor lock-in subjects customers to high storage and compute pricing.

Open data warehouses essentially combine a data lake’s flexible, scalable storage with the high performance of a massively parallel SQL query engine. While proprietary warehouses have been limited to structured data, the open route lets companies store all types of data in a data lake, allowing data consumers to access semi-structured and unstructured data for more advanced analytics.

How to build a data warehouse for free? (Or more practically, a cost-efficient one)

Data teams can leverage an ecosystem of open-source tools to build their own data warehouse on top of a cost-effective data lake.

Open query engine (e.g., Trino)

The SQL engine is how everyone in the company will interact with the data warehouse, so it’s important to choose options that are accessible to a variety of end users.

Apache Hive became popular for providing an SQL-like interface for MapReduce queries in Hadoop. Spark addressed some of Hive’s performance disadvantages.

Similar frustrations led to Presto’s development, but the momentum has since shifted to a Presto fork called Trino. Designed for interactive and ad-hoc data analytics, Trino delivers low-latency query results at petabyte scales.

Open file formats (e.g., Apache Parquet)

Data warehouses use file formats that store data from each table column close together. This approach differs from row-oriented relational database management systems (RDBMS), which are optimized for record retrieval. Columnar file formats are better suited for analytics where queries look for specific values or aggregate values within a targeted column. An open file format’s structure accelerates query performance, especially when processing petabytes of data.

Apache’s Parquet and ORC file formats enable similar data warehouse capabilities, including efficient compression, schema evolution, as well as query optimizations like data skipping.

Open table formats (e.g., Apache Iceberg)

Data lakes provide efficient object storage but with limited metadata about the data stored in each object. Open-source table formats make a data lake work as a warehouse by creating an abstraction layer with rich metadata describing the table’s schema and partitions, as well as the files stored in each object.

Several open table formats, including Hudi, Delta Lake, and Iceberg, can form the basis of a data warehouse. Apache Iceberg has seen increased enterprise adoption thanks, in part, to its large, active developer community. Iceberg efficiently delivers petabyte-scale data processing. Its key features include schema and partition evolution, time travel, and support for popular open file formats.

Open to all sources (e.g., data federation)

Commodity object storage combined with open file formats, table formats, and query engines comprise the core data warehouse architecture. However, an open-source data warehouse won’t support every kind of enterprise analytics simply because no centralized repository can store all enterprise data.

For example, warehouses can’t store everything even when built on a data lake. Some sources generate such large amounts of data that moving it to the warehouse isn’t practical. Data privacy and sovereignty regulations may govern other sources, preventing ingestion across national borders.

Data federation virtualizes the dozens or hundreds of data sources that don’t feed the data warehouse. Rather than burdening data engineers with hundreds of complex ETL data pipelines, a federated data integration happens on demand with each query. Federation also enhances data usage since consumers get direct, real-time access to more sources.

Open to decentralization of responsibilities (e.g., data mesh)

The ultimate data warehouse evolution is to open enterprise data management. The data mesh model distributes data responsibilities to domains that manage their datasets with their own engineers. Central data teams define company-wide data quality standards, metadata frameworks, and other policies governing data mesh management.

These self-serve platforms let data consumers discover and access data products, much like a software developer uses a library’s APIs. Evolving an open-source data warehouse into a data mesh lets companies push data-driven decision-making throughout the organization without overwhelming data engineering teams.

Benefits of an open-source data warehouse

Combining a data lake with open data warehouse tools eliminates vendor lock-in, streamlines warehouse management, and significantly reduces costs. Yet, open source’s most important benefit is how a modern data warehouse empowers business users at every level.

Open query engines make data accessible through standard SQL, which many business intelligence (BI) analysts and most data scientists and engineers have extensive experience with. For instance, data scientists working on machine learning projects can explore, discover, and iterate as they refine their data models using SQL in their preferred development environment.

Less technical users also benefit from an open-source data warehouse. They can pull data into user-friendly BI tools like Tableau or even Excel and produce reports without calling the data team’s help desk. Workflow automations and business dashboards can use the query engine to improve data-driven decision-making further.

Related reading: Alternative to Snowflake

Related reading: BestSecret’s data journey: Moving beyond Snowflake

PostgreSQL to open data warehouse

For many companies, the analytical limitations of their transactional databases, like PostgreSQL, provide the impetus for developing a data warehouse. Although PostgreSQL is a robust technology with lots of support, it is optimized for transactional workloads. Queries become slower and slower as these databases grow, frustrating decision-makers with extended turnaround times.

Ingesting PostgreSQL data into a data warehouse lets the database focus on what it does best while giving analysts quick access to transactional data in a system optimized for analysis.

Examples of open source data warehouse

Many open-source tools emerged from companies operating on the big data frontier. Facebook was the source of Hive and Presto. Netflix engineers developed Iceberg, while Twitter employees helped create Parquet. These companies needed rapid access to insights from a flood of data that their existing data warehouse tools couldn’t handle effectively.

While your company’s data requirements pale compared to companies like Netflix and Facebook, remember that they created these tools ten years ago when their data volumes were a fraction of today’s. Data warehouse limitations make going open source a compelling option for any company. For example, check out why German fashion innovator BestSecret replaced Snowflake with an open-source data warehouse.

Data warehousing tools

Despite proprietary data warehouse systems’ limitations, other considerations may rule out an open-source strategy. Not least is the additional burden placed on data teams to develop and maintain an open-source system. Companies with existing warehouses can use Starburst to benefit from open source software’s performance enhancements. Trino is the basis for Starburst’s data warehouse analytics platform, delivering low-latency query results from federated data sources across the enterprise. Here are a few ways Starburst can help different data warehouse implementations.

Avoid Snowflake data lock-in with Apache Iceberg tables and save money

Although Snowflake supports Iceberg tables, the jury is still out as to whether this will reduce data lock-in by making Snowflake storage more open. During periods of dynamic growth, for instance, data centralization within Snowflake may do more to constrain the organization. That was one of the reasons BestSecret reduced Snowflake’s role — and lowered costs by 70% — after using Starburst to implement Trino with Iceberg on their Azure Blob Storage data lake.

Amazon Redshift’s data warehousing platform

Amazon Redshift is an obvious choice for companies using an AWS cloud infrastructure and looking for a proprietary data warehouse. They can integrate it easily with the rest of their AWS systems. Starburst supports Redshift catalogs and provides a connector. As a result, our massively parallel SQL query engine’s performance and cost optimizations will enhance any Redshift warehouse built on Amazon S3.

Google BigQuery warehouses

Similarly, companies building their infrastructure on Google Cloud may go the proprietary route by creating a Google BigQuery data warehouse. Starburst’s data federation capabilities let them aggregate data from other sources into their BigQuery system to eliminate data silos.

Microsoft Azure Synapse warehouses

Starburst Galaxy supports Azure Data Lake Storage (ADLS) catalogs, so you can connect our Trino-based analytics platform with your Azure Synapse data warehouse. Together, Starburst and Azure Synapse become the center of gravity for data transformation by letting users access Synapse and other data sources from a single query.