Why an open source data architecture is the future of data analytics

Share

Traditional relational databases like Postgres and MySQL, optimized for transactional data, can’t deliver the fast query performance business analysts need. Proprietary data warehouses do better but keep driving data costs higher. Open-source data architectures provide the affordable, flexible performance demanded by the velocity, volume, and variety of modern data.

Once, only companies like Google, Netflix, and Facebook had the talent and resources to patch together open-source tools for hyperscale data processing. Skip forward a decade to an amazingly diverse open-source data ecosystem with multiple options for each stage of the data lifecycle. At the same time, every organization now generates unprecedented amounts of data that could drive innovation. 

Open-source technologies should now be the bedrock of every data strategy.

In this article, we’ll see how the open data stack delivers substantial benefits despite its challenges. With the right tools, extending an analytics platform beyond the warehouse creates an open data architecture that unlocks insights across the enterprise.

What is an open data architecture?

An open data architecture is a big data analytics technology stack that uses open-source tools for the ingestion, orchestration, data transformation, and visualization of enterprise data at scale. Data teams use these tools to streamline data management, reduce costs, and improve end-users’ data access.

Challenges adopting OSS

The open-source paradigm promises to be a better alternative to proprietary software, but any enterprise application — open or closed — occupies a trade space. Companies must evaluate several open-source software challenges.

Expertise, recruiting, and retention

An immediate question is whether the data team has the skills to develop an in-house data architecture. Can you assign knowledgeable engineers to each tool? Do your managers have end-to-end experience with the full stack?

If not, you have to hire for the project in a job market where skilled data engineers are in high demand. This task gets more challenging if you choose safe but aging open-source projects. The best engineers won’t sign up if they can’t gain valuable skills working for your company.

Budgets, resources, and commitment

Building it yourself requires considerable time, effort, and resources from the data team — while maintaining high service levels for all legacy systems. Executive sponsorship is essential to getting the budget and resources while getting stakeholders throughout the company to accept how their analytics practices must change.

Paradox of choice

You must select from multiple options at every level of the technology stack. Some will solve a few specific problems. However, many open-source projects have complex feature sets. You must understand the nuances distinguishing each project’s version of those features. Other factors also come into play, including ease of use, license terms, scalability, and interoperability.

Support ecosystem

Proprietary and open-source software have different support models. In theory, vendors provide dedicated, well-trained support staff and rich resources for troubleshooting issues with their products. In theory, the open-source community is always available with friendly advice on forums and well-written documentation in repositories.

In practice, you can find best-in-class experiences and support nightmares with either approach. Open-source architectures, however, will always be more fragmented than a monolithic solution.

Implementation and maintenance

That fragmentation becomes more apparent as you integrate different open-source projects into a single architecture. All development responsibilities now fall on the data team, as do ongoing maintenance and internal support tasks. Moreover, your team must manage a multi-year development roadmap to keep up with business demands.

What are the benefits of open-source data architecture?

Despite the challenges presented by significant changes to business practices, transitioning to an open-source data architecture offers considerable benefits.

Affordability

Open-source software is cost-efficient to download and use without license fees. In addition, open data warehouses run on commodity cloud computing platforms without the unpredictable and expensive pricing of a proprietary solution. 

Of course, the open-source model also adds development, support, and maintenance costs. Before adopting this approach, you must assess the total cost of ownership to decide whether the returns are worth the investment.

Flexibility

Data warehouses like Snowflake or BigQuery tie your analytics capabilities to your provider’s services. You have to accept being unable to do things the product doesn’t support. Open-source ecosystems let you design an architecture that meets today’s business needs while laying a foundation for the future. 

Scalability

Hadoop and other early open data platforms started in the data center. Recent projects like Apache Iceberg are designed for commodity cloud services, decoupling compute from storage for independent scalability. Gone are the days when you overpay a vendor for one because you used a little more of the other.

Accessibility

Open source can drive data democratization, letting data consumers at any skill level source data from more datasets with less reliance on over-burdened data engineering teams. Expert users can leverage APIs to develop machine learning and other big data science applications. Less sophisticated users can access real-time metrics on data dashboards to make more effective decisions.

How to think about build vs buy

Ultimately, the choice between building and buying a data warehouse comes down to each organization’s context. What makes sense for your business may not for another. As you consider which path to take, keep the following tips in mind:

Define requirements: Before jumping into the open source pool, define the business, regulatory, and other requirements driving the need for change.

Understand costs: Open source is free to download, but it isn’t without cost. Likewise, solution providers with high fees may offer a reasonable return on investment. Evaluate costs thoroughly.

Get buy-in: Moving to an open data architecture requires process and cultural changes that depend on executive support and stakeholder acceptance.

Design for the future: Your data architecture will define how your company innovates for years to come. Don’t just follow the market. Design an architecture that builds a data-driven organization.

Finding the right tools for your data

If open source is the best choice, the next step is finding the right tools among the myriad options. Need to automate workflows and data pipelines? There’s Airflow. Need optimized, scalable SQL queries? Look at Trino (formerly PrestoSQL). Many enterprise-class options live within the Apache Software Foundation’s stable of open-source projects.

Apache Iceberg built for Trino

In fact, Apache Iceberg was originally created for Trino at Netflix and it was tailored to optimize performance and scalability from the outset. Others are cultivated by corporate sponsors, as with Databricks’ support for Delta Lake.

What’s important is not limiting yourself to the tools your team already knows how to use. The right tool for your data may impose a steep learning curve, but that investment will pay dividends when you provide better services to your data customers.

Open, closed, or both?

Deciding to develop an open-source architecture doesn’t mean you should always take the free option. For instance, why force your company’s business intelligence analysts to use Apache Superset in the name of open-source purity when they’re already expert Tableau users?

Similarly, the paid enterprise implementation of an open-source project may better fit your needs. Trino is a massively parallel SQL query engine that federates data from multiple sources. Founded by Trino’s creators, Starburst offers performance optimizations and management features that streamline Trino-based analytics.

Using the right tools together

Interoperability is critical for a successful analytics stack. It won’t matter that you picked the best tools for each job if they don’t play together nicely. Does your preferred data lake platform integrate with your chosen cloud service provider? Can your query engine access your file system’s metadata?

Your open lakehouse platform

Today’s businesses generate more data and more kinds of data with more governance requirements than a conventional data warehouse can ever manage. Yet, easy access to all enterprise data is essential for generating innovative insights and enhancing data-driven decisions. 

An open data stack must provide warehouse-level analytics capabilities while supporting workloads and use cases that depend on a diverse data ecosystem. Starburst extends a Trino-based open lakehouse platform (some might call it an open data warehouse) to become a single point of access with benefits including:

Democratized access: Starburst’s support for ANSI standard SQL and dataframe API lets expert and general users alike access the data they need with tools they already know.

Data source integration: Connectors to more than fifty enterprise data sources eliminate silos and open access to all structured, semi-structured, and unstructured data.

Reduced storage costs: Rather than consolidating everything within a central repository, Starburst queries data where it lives to eliminate redundant storage expenses and data duplication.

Balanced performance and compute costs: Optimized query planning, cached views, pushdown, dynamic filtering, and other Starburst query features let you balance performance and cost.

Enhanced governance: Fine-grained role-based and attribute-based access controls let you enforce granular governance policies that improve compliance while allowing appropriate access to record-level or aggregated data.

Open-source data architectures are no doubt powerful. To learn more, see how Shopify cut execution time from hours to seconds with Iceberg and Trino.

Open data lakehouse

Build a warehouse-like experience on your data lake with ease

Learn more