Last Updated: 2024-07-11
When using Starburst Galaxy to run a query, only the necessary data for the query is read from the query engine. The coordinator then divides the workload into splits, and allocates them to worker nodes. Collectively, workers execute the workload by processing, in parallel, their assigned splits and only keeping in memory the data that is not yet ready to send to the next stage. This approach increases performance compared to other frameworks that spill this intermediary data to disk.
However, some queries require a large amount of memory. In such cases, problems can arise if individual workers run out of memory. A similar situation can happen if a worker is disconnected from the group due to a network error. This situation often leads to query failures because the scheduled resources prove unavailable when the time comes to execute the workload.
To address this issue, Starburst has introduced a fault-tolerant execution (FTE) feature that enhances the resilience of Starburst platforms when handling memory-intensive workloads. This model allows intermediary data to be stored into an immensely scalable persistence layer, typically by pushing this data to disk that would normally be held only in memory. This prevents the out of memory conditions and saves the query from failure.
Additionally, fault-tolerant query execution allows for failed tasks to be reprocessed. This allows long-running workloads to complete without worrying about query failures, even with limited hardware resources. This means that failed workers or network errors no longer result in query failure.
In this tutorial, you will learn how to enable fault-tolerant execution mode on a cluster. You will also have the chance to observe FTE in action and compare it to a cluster without FTE enabled.
You need a Starburst Galaxy account to complete this tutorial. Please be sure to complete the tutorial titled Starburst Galaxy: Getting started before attempting this tutorial.
Upon successful completion of this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
Chryse Corp. is a large retail organization that heavily relies on data analytics to drive business decisions. They have implemented Starburst Galaxy as their data analytics platform to query and analyze vast amounts of data from various sources. To ensure uninterrupted query execution and prevent costly disruptions, Chryse Corp. has decided to leverage Starburst's fault-tolerant execution mode. This mode enhances resilience by reprocessing failed tasks and ensures that long-running queries can be executed without the fear of query failures, regardless of the hardware limitations.
Help the Chryse Corp. team test the fault-tolerant execution feature by creating two clusters - one with FTE mode enabled and one without. Run a resource-intensive query on both clusters to compare the results.
To get started, you need to create a new cluster with fault-tolerant execution enabled. This process will be similar to the creation of other clusters that do not have fault-tolerance engaged.
Begin by creating a new cluster. This follows the usual cluster creation process seen in other tutorials.
You're going to start by giving the new cluster a name. This should be something descriptive of its use and location.
fte-aws-us-east-2
, but you are free to choose a different name if you'd like.Now it's time to choose which catalogs to include in the cluster. For this tutorial, the only catalog you need for testing purposes is the tpch
catalog.
tpch
. Next, you'll need to add the remaining cluster details to set up the new cluster. We recommend using the following configuration.
It is considered best practice to confirm that the new cluster has started before proceeding.
The cluster will start up automatically, but can take 3 to 5 minutes to start.
Now it's time to create a standard cluster without FTE enabled. This cluster will be used for comparison purposes in the next section.
Begin by creating a new cluster, this will become the non-FTE cluster.
Now it's time to provide a cluster name. This will be used to identify the cluster, so it should be descriptive.
no-fte-aws-us-east-2
, but you are free to choose a different name if you'd like.Next, you'll need to add the necessary catalogs to the cluster. For this tutorial, the only catalog you need is tpch
.
tpch
. Finally, you'll need to add the remaining cluster details to set up the new non-FTE cluster. We recommend using the following configuration.
Just as you did for the FTE cluster, it is considered best practice to confirm that the new cluster has started before proceeding.
Now it's time to test out the differences between FTE and non-FTE clusters so you can see the difference first hand.
To do this, you'll run the same query on each cluster and then compare the results.
Let's get going!
You're going to launch your query from the query editor.
no-fte-aws-us-east-2.
Next, you need to enter a suitable query that will be run on the non-FTE cluster, and then again on the FTE cluster. The query below will fit this role perfectly because it requires a lot of memory to execute. This is exactly the scenario that FTE is designed to assist with, so we would expect this non-FTE cluster to encounter problems.
EXPLAIN ANALYZE SELECT
nation,
o_year,
SUM(amount) AS sum_profit
FROM (
SELECT
n.name AS nation,
extract(YEAR FROM o.orderdate) AS o_year,
l.extendedprice * (1 - l.discount) - ps.supplycost * l.quantity AS amount
FROM
tpch.sf1000.part AS p,
tpch.sf1000.supplier AS s,
tpch.sf1000.lineitem AS l,
tpch.sf1000.partsupp AS ps,
tpch.sf1000.orders AS o,
tpch.sf1000.nation AS n
WHERE
s.suppkey = l.suppkey
AND ps.suppkey = l.suppkey
AND ps.partkey = l.partkey
AND p.partkey = l.partkey
AND o.orderkey = l.orderkey
AND s.nationkey = n.nationkey
AND p.name LIKE '%green%'
) AS profit
GROUP BY
nation,
o_year
ORDER BY
nation,
o_year DESC;
Once the query has been executed, you'll see that it does indeed encounter a problem. Specifically, it runs out of available memory. The error you get should be similar to the image below.
Starburst Galaxy allows you to view more details about queries. This is used to troubleshoot scenarios like this one, where a query failed, or to help analyze query performance.
You'll need to navigate to the query details section to view this information.
Now it's time to run the same query on the FTE cluster and see the difference.
fte-bootcamp-aws-us-east-2
.Your testing is complete, and you can now delete your clusters so that you don't incur any unnecessary charges.
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of how and when to use fault-tolerant execution mode for a cluster in Starburst Galaxy.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!