What is Apache Spark?
Lester Martin
Developer Adocate
Starburst
Lester Martin
Developer Adocate
Starburst


More deployment options
Any conversation about the data engineering space should begin with a discussion of what Apache Spark does and how it fits (or does not fit) into a particular organization’s data architecture plans. This post will present an overview of Spark, with a particular focus on how it relates to Starburst. It will explore this important clustering technology, show how it can coexist with Trino, and ultimately describe our initial plans for integrating Spark into Starburst products.
What is Apache Spark?
Apache Spark was initially created as a replacement for Machine Learning (ML) workloads executing on Apache Hadoop. As such, it is often related to the Hadoop ecosystem. It often yields 100x performance over Hadoop-based ML processing of extremely large-scale datasets. These gains are primarily due to Spark retaining CPU and memory resources until a job has been completed, as well as caching training datasets in memory for the duration of the analysis. A decade after its inception, Spark is still considered the de facto clustering technology for ML processing on large-scale datasets.
Spark and data transformation use cases
Spark also accounts for a significant percentage of the transformation processing workloads in production today. This workload approach does not benefit from the caching of datasets the same way ML processing does, but holding onto compute resources until all work is complete does offer 5-10x better performance than classical Hadoop solutions. In this sense, Trino and Spark have much more comparable performance characteristics due to resources not being returned and reallocated within a job’s execution.
Spark presents itself as a set of APIs for these popular programming languages: Python, Scala, Java, and R. Regardless of language, the primary interface is called the DataFrame API. The Dataframe Programming Guide captures examples of how programmers use this popular framework.
A deeper dive into Apache Spark is available in the following video.
Spark and Trino
It’s tempting to consider Spark and Trino together, but what differences and similarities do they share?
Spark vs Trino
The obvious difference between Spark and Trino is that Trino users primarily interact with it using standard SQL. Trino is also connected from clients by xDBC client applications, including popular BI frameworks. Conversely, Spark users interact with it through a programming language API.
How to use Spark and Trino together
Using Spark and Trino together has certain advantages. It draws together the comparative differences and advantages of these two data architectures.
But how do you do it?
One possible approach is to leverage Spark for initial streaming of raw data and persisting cleansed and standardized data into a modern table format such as Apache Iceberg. Meanwhile, that same data architecture could utilize Trino, or a Starburst environment, to build views representing aggregated values and ultimately be used to query this business-ready consumption tier. This is just one possible data architecture.
Data lakehouse medallion architecture and Spark
The figure below presents the industry-endorsed and technology-agnostic data lakehouse medallion architecture, where bronze represents the raw data received, silver is the validated and enriched data that represents the core data for a given organization, and gold holds business-optimized datasets to foster decision-making with optimized performance.
Much of the processing to support this architecture takes place in three domains, which are listed in the diagram above. These include:
- Ingestion – Spans the extract and load tasks of getting data from remote sources and landing them into the data lake. This includes scheduled batch activities as well as near real-time streaming approaches. The end result of this type of processing occurs in the bronze layer.
- Transformation – A wide bucket that encompasses traditional processes focused on validating, standardizing, and enriching data to solidify the silver layer as well as for more analytic purposes of joining and/or aggregating data to produce a high-performance gold layer. Transformation processing can be done programmatically as well as with a SQL approach. Organization-specific logic can be introduced via User Defined Functions (UDF) in both approaches.
- Data Products – Curating and presenting business-relevant and performance-oriented gold layer datasets is the backbone of data products. Ideally, data products should provide UI and API access for humans and processes to discover what is available, understand usage characteristics, review usage scenarios, and more.
Spark vs Trino vs Starburst
The following table details how well Spark, Trino, and Starburst frameworks natively address each of these technical domains. This table does not show how including additional open-source and commercial frameworks and products would further address these domains.
Spark | Trino | Starburst [1] | |
Ingestion | Can read from a variety of data sources. Has a sophisticated complex-event [2] streaming framework. | Can read from a variety of data sources. | Same as Trino, plus includes Kafka and file simple-event [2] ingest frameworks. |
Transformation | Programming-oriented transformation processing abilities, including creating sophisticated UDFs. | SQL-oriented transformation processing abilities. UDFs can be created in-stream or be added administratively from compiled Java code. | Same as Trino, plus includes a job scheduler feature to schedule SQL executions such as transformation tasks. |
Data Products | Apache Spark has no formal approach to this. | Open source Trino has no formal approach for this. | Mature Data Products functionality available. |
[1] Starburst offers end-to-end security & governance features from ingestion to data products.
[2] Simple-event streaming/data ingestion engines process records in a mutually exclusive manner, while complex-event engines can additionally perform analysis and calculations across a wider set of records, such as calculating rolling averages or maintaining high/low values for various groupings.
Using Apache Spark in the Starburst stack
As the comparisons above suggest, Apache Spark and Starburst are complementary technologies that are very often already working together in unison. Starburst’s vast array of data connectors has been the backbone of accessing external data, but Starburst’s automated ingest frameworks are still relatively new. There are a plethora of ingest frameworks available today that can be leveraged, but it is fair to say that Spark’s Structured Streaming is a mature and successful offering for streaming sources. For non-streaming (i.e. batch) ingestion Spark is similar to Starburst in features.
In data engineering, familiarity is king
This “similar features” assessment is absolutely the case as you look at transformation processing. The biggest deciding point between Starburst and Spark is the preference and familiarity of the data engineers involved. Does a team prefer to perform the majority of these transformation jobs with SQL or via a programming language? In some organizations, internal standards dictate this decision of SQL or programming based data engineering.
Fortunately, for Starburst Galaxy and Starburst Enterprise customers with data engineering teams that feel strongly they need a Spark DataFrame API option, there are options for them to have Spark exist alongside Starburst. These include deploying Spark themselves or leveraging a dedicated Spark service provider.
Dell + Starburst + Spark
Dell and Starburst recently announced that Apache Spark will become part of the Dell Data Lakehouse appliance. This will provide a single software stack that includes Spark and Starburst together – all together in a managed environment from a single vendor. Spark will help Dell’s AI initiatives by allowing the processing of unstructured data to prepare data for Generative AI (GenAI) initiatives. When complete, this will represent the ultimate “Spark and Starburst together” story.
Additionally, there are other DataFrame API options that work with Starburst. You can learn more about these options from prior blog posts on PyStarburst and Ibis. Both of these solutions run their transformations in the Trino SQL engine that Starburst is powered by.
The case for Starburst + Spark
In summary, many organizations will want to, or already are, run(ning) Starburst and Spark together to fuel their overall architecture and implement their business needs. As you have seen, there are multiple options available to you for these popular frameworks to coexist. Additionally, Starburst will continue to make Spark integration even more seamless over time.