Why wait for sequential processing when parallel is faster?
Cindy Ng
Sr. Manager, Content
Starburst
Cindy Ng
Sr. Manager, Content
Starburst
Share
More deployment options
This guide will explain sequential and parallel processing, their differences, and why processing enterprise data depends on parallelization.
What is sequential processing?
Sequential processing is a method for executing a group of tasks, such as the statements in a software algorithm, in series, only beginning the next task after the current task is complete. Until the advent of multi-core processors, sequential computing was synonymous with general-purpose computing. Even with the rise of parallel computing, sequential coding remains commonplace.
What is the difference between sequential and parallel processing?
Whereas sequential processing executes each step in an algorithm one after the other, parallel processing executes identical steps simultaneously. Each approach has strengths or weaknesses. Some tasks must happen sequentially, such as when a task has a dependency on the output from an earlier task. However, many data operations lend themselves to parallelization.
Parallel vs. concurrent processing
Note that parallel processing and concurrent processing, while often treated synonymously, aren’t the same. Parallel processing is concurrent, but concurrent processing is not necessarily parallel. A concurrent system processes tasks around the same time. That could mean the processing happens simultaneously, but it could happen sequentially as when a multi-tasking operating system juggles commands from various applications.
Processing as a sequence of operations
Modern desktops, laptops, and smartphones use multi-core CPUs with graphic cards that would once have been considered supercomputers in their own right. Cloud computing takes it to the next level by grouping millions of multi-core computers into a virtual pool of computing resources.
Whether running on-device or in the cloud, however, a sequential program will still run as if it was on a single computer with a single processor, taking a long series of individual processes and working on them one-by-one. A FOR loop that executes five statements ten million times becomes at least fifty million sequential steps in execution. It will be much, much more since the human-friendly syntax of SELECT FirstName, LastName FROM Customers gets translated into multiple execution steps.
What happens when you query a dataset?
For example, consider how a query would create a string combining the first and LastNames from a customer table:
Row | FirstName | LastName |
1 | Alice | Smith |
2 | Bob | Jones |
3 | Wendy | Doe |
4 | Charlie | Williams |
A sequential algorithm would execute the following steps in a single processor:
Processor 1 |
Get FirstName from row 1Get LastName from row 1
Combine FirstName, “ ”, LastName into FullName Save FullName to row 1 in the data file |
Get FirstName from row 2Get LastName from row 2
Combine FirstName, “ ”, LastName into FullName Save FullName to row 2 in the data file |
Get FirstName from row 3Get LastName from row 3
Combine FirstName, “ ”, LastName into FullName Save FullName to row 3 in the data file |
Get FirstName from row 4Get LastName from row 4
Combine FirstName, “ ”, LastName into FullName Save FullName to row 4 in the data file |
What is trivial with four records takes a lot of time when processing millions of records. The only way to speed this process is to use a faster computer, but that runs into physical, thermal, and financial limits.
Parallel processing executes similar tasks simultaneously. In our example, the data selection, combination, and saving happen for every record in the table. Here’s how parallelization gets the job done in a fourth of the time by spreading the workload across four processors:
Processor 1 | Processor 2 | Processor 3 | Processor 4 |
Get FirstName from row 1Get LastName from row 1
Combine FirstName, “ ”, LastName into FullName Save FullName to row 1 in the data file |
Get FirstName from row 2Get LastName from row 2
Combine FirstName, “ ”, LastName into FullName Save FullName to row 2 in the data file |
Get FirstName from row 3Get LastName from row 3
Combine FirstName, “ ”, LastName into FullName Save FullName to row 3 in the data file |
Get FirstName from row 4Get LastName from row 4
Combine FirstName, “ ”, LastName into FullName Save FullName to row 4 in the data file |
Of course, parallel processing does introduce synchronization overhead as the system must divide the tasks and integrate the results. However, the extra work pales compared to the time savings you get by doing the job in parallel.
Technology terms related to sequential processing
Sequential algorithm
A sequential algorithm comprises a series of tasks, instructions, or commands a process must execute in order.
Sequential execution
Sequential execution is the consecutive completion of each task in a series.
Parallel execution
Parallel execution is the simultaneous completion of a single task, or set of tasks, on multiple data elements.
Massively parallel processing query engine
A massively parallel processing query engine interprets a query statement and parallelizes it into tasks for simultaneous execution.
What is parallel processing
Although sequential processing remains a fundamental element of computer science, parallel programming is taking on the most demanding workloads thanks to the increasing number of processors and cores in modern computers.
Data-driven enterprises depend on the insights they gain by analyzing vast data pools at petabyte scales. This kind of big data analysis would be impossible with sequential processing but becomes instantly accessible with commodity cloud compute resources and modern query engines.
Parallelization and splits
The way parallelization works is that a query engine takes a single task that will be applied to large datasets and divides it into hundreds or thousands of discrete work units called splits. The query engine distributes these splits to multiple processing nodes that execute the splits simultaneously.
Query processing, schema-on-write, and schema-on-read
Whether processing tasks sequentially or in parallel, a query engine needs to understand the structure of each table it accesses to execute tasks effectively. The schema defines table properties like the number of columns in a table or the relationship between the table’s columns and columns in other tables. How a query uses this information to process data will depend on the schema design philosophy underpinning the data repository: schema-on-write or schema-on-read.
Schema-on-write
Engineers use schema-on-write when developing relational databases, data warehouses, and similar repositories. The schema defines the type, format, and other properties of each data column in a table. Nothing can go into the table until the schema is defined and nothing can go into the table unless it complies with the table’s schema.
This is why extract, transform, and load (ETL) data pipelines play such integral roles in maintaining traditional enterprise data repositories. Raw data must be transformed and structured to meet the schema requirements before becoming available for analysis — even if nobody ever uses the data.
Schema-on-write makes databases extremely reliable. All the work that goes into the ETL pipelines ensures that all the data is in the format users expect, making it easier to discover, access, and analyze.
Schema-on-read
Schema-on-read takes a different approach. Raw data flows into repositories like data lakes without any processing regardless of data structures, formats, or values. A schema only gets created when a query accesses the data. Unlike the one-schema-to-rule-them-all approach of schema-on-write, schema-on-read takes a schema-of-one approach, effectively creating as many schemas as there are queries.
Schema-on-read makes data lakes extremely flexible by leaving data in its raw state and letting each data user define the schema most appropriate for their queries.
Schema-on-write vs. schema-on-read
Schema-on-write prioritizes consistency and predictability over flexibility and cost. Data consumers can complete their analyses quickly since they spend less time searching for and preprocessing data.
However, schema-on-write imposes significant costs. Data architects must spend considerable time developing a consensus schema that meets as many competing demands as possible. Changing requirements may require significant re-engineering.
Schema-on-write consumes engineering resources to maintain the ETL pipelines that enforce schema compliance. And since all data gets transformed at ingestion, the company spends much of its compute budget processing data unnecessarily.
Schema-on-read avoids these headaches. There’s no need to transform data during ingestion which conserves resources and simplifies pipeline design. Moreover, schema-on-read doesn’t impose structure on the data lake’s contents. Each query defines its own schema which makes data lakes more responsive to changing business needs.
Massively parallel processing query engine
Starburst’s open data lakehouse analytics platform builds upon the high performance and low latency of the Trino open-source project’s massively parallel processing query engine to speedup query execution times on petabyte-scale datasets.
In addition, performance enhancements and cost-based optimizations allow Starburst to meet the needs of all data consumers, from the business intelligence analysts running ad hoc investigations to the data scientists developing innovative machine learning applications.