Massively Parallel Processing
Trino has become the fastest open-source SQL query engine thanks to its application of massively parallel processing (MPP) to the vast amounts of data enterprise analytics depends on.
This guide will explain how parallel processing works and how that translates into Trino’s growing popularity for big data analytics.
How massively parallel processing works
Once limited to scientific supercomputers, parallel processing has become ubiquitous in today’s data architectures. Modern enterprise data analysis wouldn’t be possible without the ability to execute the same task on different data simultaneously.
An MPP query engine relies on two types of compute nodes running in the same cluster. At a high level, a leader node divides tasks among dozens, hundreds, or thousands of processing nodes.
In Trino’s parlance, it would be a coordinator node distributing slices among many worker nodes. The coordinator analyzes the SQL query to create an execution plan and distributes slices to the cluster’s worker nodes for simultaneous processing before returning the results to the coordinator.
Why massively parallel processing is significant
Analytic queries are particularly responsive to parallelization, creating significant business benefits.
High performance analytics
Each step in a query involves performing the same action on thousands or millions of data values. Trying to do this sequentially would take months or years. Throwing high-speed processing power at the problem reduces latency dramatically and compresses query times into hours or minutes.
Query acceleration generates several business benefits. By getting the organization’s baseline workloads done faster, parallel processing frees up compute capacity for additional analytics projects. In addition, batch processes can be completed in much less time to populate dashboards and data warehouses quicker.
Scalability
MPP systems leverage the dynamic scalability of high-performance cloud computing platforms. Data teams can dial up their compute capacity to process large amounts of data without having to make large investments in fixed computing infrastructure investments.
Improved fault tolerance
Parallelization is the cornerstone of Trino’s fault tolerance functionality. The coordinator node monitors all the worker nodes in the cluster. When it detects a failure in one of the workers, the coordinator replaces it with a new worker and feeds it buffered data to resume the query as if nothing happened.
Cost-effectiveness
MPP query engines running on cloud computing platforms reduce costs over traditional data warehousing analytics. Companies can schedule the compute resources they need as they need them, thus avoiding the expense and limited flexibility of dedicated on-premises data centers. Moreover, companies aren’t locked into their warehouse vendors’ expensive pricing policies.
Trino’s ability to federate enterprise data sources further reduces costs by limiting the data volumes being moved from one place to another. Trino queries run at the source, so raw data does not need to be moved or copied. Not only does federation avoid cloud platforms’ data transfer fees, but it also reduces the bandwidth pressures on enterprise networks.
Faster time to insight with big data
The most significant benefit of massively parallel processing query engines is the speed of insight generation. Large-scale data science projects can complete their iterative development cycles in a fraction of the time. Fault-tolerant batch processing makes data access more reliable and ensures business intelligence analysts have the information they need, when they need it. And by cutting query times, MPP systems create new opportunities for near real-time data analytics.
Data-driven organizations depend on this rapid access to insights to give executives the information they need to make better, more responsive business decisions.
Massively parallel processing use cases
Given these benefits, it’s no surprise that massively parallel processing query engines are widely adopted across industries. Some MPP use cases include:
Data warehouse
Adding an MPP query engine to a data warehouse compensates for the warehouse’s inherent weaknesses. For example, warehouse analytics can only explore the structured data contained within the warehouse itself.
A modern query engine like Trino works with structured, unstructured, and semi-structured data stored anywhere within the enterprise. By interconnecting these varied data sources, Trino queries can support more sophisticated analyses that reveal insights unavailable in the warehouse alone.
Data science and machine learning
Data scientists benefit from MPP query engines at every stage of the development process. Ingesting massive datasets, exploration and discovery, extracting training data, and running the final algorithms require an enormous amount of data processing. These workloads would be impossible to complete without parallelization’s ability to accelerate query execution.
Business intelligence and analytics
Analytics isn’t always about long-term data science initiatives. Decision-makers need answers to immediate business questions. However, analysts can’t respond quickly if they must compete for constrained compute resources.
Trino’s massively parallel processing, combined with its support for ANSI-standard SQL, makes big data analytics more accessible. Analysts can explore datasets, run queries, and get results faster to support the company’s day-to-day decision-making.
Real-time data processing
Data warehouses and data lakes have become important tools for long-term and near-term decision-making. However, their dependence on data ingestion pipelines and overnight batch processing has always limited these centralized repositories to less immediate analytical tasks.
The speed of an MPP query engine like Trino becomes invaluable for real-time and near-real-time use cases like monitoring cybersecurity or financial transaction logs for compliance violations.
Technology terms related to massively parallel processing
MPP query engines depend on a number of technologies to make these use cases possible. Some you may run into include:
Parallel computing
Initially developed for supercomputer applications, parallel computing is the general process of dividing large computational workloads into smaller tasks for simultaneous execution.
Sequential processing
Before parallelization became ubiquitous, computer systems could only handle one piece of data at a time. Improving performance required more powerful, and more power-hungry, central processing units (CPUs). This brute-force approach to performance is impractical at petabyte scales.
Cluster computing
This parallel computing architecture connects multiple computers within a high-bandwidth, low-latency network. Computer clustering forms the basis of cloud computing business models. Customers pay for a share of the provider’s compute resources bundled in a cluster that runs dozens or hundreds of parallel processing nodes.
Distributed computing
This parallel architecture divides tasks across a more loosely connected network like an office network or the internet. Trino’s ability to pushdown parallelized queries to federated data sources is an example of distributed computing that reduces the amount of data flowing across enterprise networks.
Multiprocessing
Modern CPU architectures combine multiple processing cores within a single piece of silicon. Operating systems and applications split tasks across these cores for simultaneous execution. Multiprocessing shouldn’t be confused with multi-tasking, which is the more efficient execution of sequential computing.
Data lakes
While proprietary data warehouses do leverage parallel architectures, data lakes are the only way to realize the full benefit of MPP query engines. These centralized repositories of structured and unstructured data run on a cloud computing provider’s object storage service. Inherently more scalable and affordable than closed warehouse systems, data lakes cannot match the warehouse’s analytics capabilities.
MPP architecture: Why Starburst users would be interested in massively parallel processing
The developers who created the open-source Trino project founded Starburst to make massively parallel processing more accessible to enterprises. Both Starburst Enterprise and Starburst Galaxy leverage Trino’s capabilities to deliver dramatic improvements in query performance.
Understanding massively parallel processing helps Starburst users learn how the system works and how to optimize their queries. When users write a SQL query in the Starburst interface, they are actually interacting with the coordinator node. This node interprets the user’s SQL statements and converts them into an execution plan that balances the number and cost of worker nodes. Users should be aware of where that balance is set to ensure their projects are completed in a reasonable time and at a reasonable cost.
Starburst open data lakehouse and massively parallel processing
Starburst builds upon Trino’s open-source MPP query engine to create an efficient, performant data lakehouse analytics solution that combines the affordable scalability of a data lake with the sophisticated analytics of a data warehouse. Starburst’s architecture consists of four layers that unify enterprise data sources within a single point of access.
Data access layer: Connectors make over fifty kinds of warehouses, lakes, databases, and other enterprise data sources part of Starburst’s analytics platform.
Security and governance layer: Role-based and attribute-based access controls and other governance features enforce granular access policies to ensure compliance.
Query engine: Starburst enhances Trino with performance and cost-based optimizations.
Modeling/semantic layer: Leveraging ANSI-standard SQL, Starburst streamlines analytics and data products by providing access to materialized views, catalogs, and other data exploration resources.