Last Updated: 2024-01-28
Finding the correct cluster size is an exercise in balance between the improved performance of larger clusters and the lower costs of smaller ones. However, determining the optimal cluster size quickly becomes complicated owing to variability in the number, type, and size of queries running on the organization's clusters.
In such cases, data architects and data engineers must either construct clusters large enough to handle periods of high activity or accept that queries will take longer to run during those periods.
To address this challenge, Starburst Galaxy offers cluster autoscaling. This feature allows clusters to dynamically and automatically scale according to the current query load. With cluster autoscaling, you can choose a predetermined range of cluster sizes that suit your needs, with the system optimizing for the correct conditions. This results in both cost and time savings.
In this tutorial, you will learn how to use non-disruptive cluster autoscaling capabilities to automatically adjust the size of your Starburst Galaxy cluster based on workload demands. You will also explore the manual options. This will allow you to optimize resource utilization and enhance scalability.
You need a Starburst Galaxy account to complete this tutorial. Please be sure to complete the tutorial titled Starburst Galaxy: Getting started before attempting this tutorial.
Upon successful completion of this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
You are part of a team responsible for managing the infrastructure of Tharsis, a popular e-commerce business that uses Starburst Galaxy as its data lake analytics platform. During peak shopping seasons, such as Black Friday or holiday sales, the platform experiences a significant surge in traffic and transactions.
To ensure a seamless shopping experience for customers and maintain high-performance levels, you need to implement cluster autoscaling to dynamically scale the infrastructure based on workload demands. Help your team by testing out the different scaling methods in Starburst Galaxy.
By engaging in this real-world scenario, you will gain practical experience in implementing autoscaling and manual scaling for a Starburst Galaxy cluster.
Starburst Galaxy's approach to cluster scaling is non-disruptive, increasing and decreasing resources as needed without shutting down the cluster.
These include the following three methods:
The first way to enable non-disruptive cluster scaling is through cluster autoscaling. This is by far the most automatic option. It occurs in two ways:
Starburst Galaxy identifies when a cluster requires extra resources by measuring the CPU usage of all workers in the cluster.
If this combined CPU usage exceeds 60%, then workers will be added one-by-one until this usage has dropped below the threshold. Importantly, this process occurs iteratively, with each new worker being added every 4 minutes until CPU usage drops below the 60% threshold. If the CPU continues to climb and goes above 60% again, the process will be repeated until the MAX number of workers is reached.
The process also works in reverse. Starburst Galaxy monitors clusters to determine whether CPU usage across all workers consistently drops below the 60% threshold.
In such cases, workers will be removed from the cluster one-by-one, until the usage increases. Again, this process occurs iteratively, and the process to remove each worker takes approximately 15 minutes.
Starburst Galaxy also allows for manual cluster scaling. Similar to autoscaling, this process is non-disruptive. Resources can be added or removed in running clusters.
Clusters are configured with a MAX and MIN value. These values can be manually updated to scale resources up or down. In each case, the cluster will be updated dynamically, with the new resources either added or removed.
The MIN value controls the minimum number of resources allocated to a cluster.
The MAX value controls the maximum number of resources allocated to a cluster.
Starburst Galaxy also allows running clusters to dynamically shift between the free tier and other tiers.
To facilitate this transition, the following process takes place:
Cluster scaling requires the creation of a new cluster with a different resource allocation. You're going to create this new cluster, and then test it using autoscaling.
Watch the following to see a run-through of the tutorial. In the next sections, you're going to then complete the steps on your own.
You're going to begin by signing in to Starburst Galaxy and setting your role before you begin working with autoscaling.
This is a quick step, but an important one.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Starburst Galaxy separates users by role. Configuring a new catalog will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.
Your current role is listed in the top right-hand corner of the screen.
In this section, you will create a new test cluster that is capable of autoscaling. You have determined that during peak usage, your platform requires at least four workers to handle the cluster workload. Therefore, you will set the range of workers between 1 and 4.
In the next section, you will run several simultaneous queries. This will force the workers to scale up, enabling you to see autoscaling in action.
Next, you need to configure the new cluster. You'll need to ensure that the correct minimum and maximum cluster size is selected to facilitate your autoscaling tests later in this tutorial.
Use the following configuration.
Note: You will be using the tpch
catalog for testing.
aws-us-east-2-autoscale
.us-east-2
. Clusters take several minutes to initialize. While the cluster is being created, its status will be set to Starting.
In this section, you will execute SQL to force your cluster to scale up and down. You will monitor the cluster while this is happening.
Watch the video below, then complete the steps on your own.
Ok, your cluster is created! It's time to look inside it to see what's going on.
Starburst Galaxy includes a Cluster Overview section to view cluster activity. This includes information about the Min and Max cluster scaling.
aws-us-east-2-autoscale
cluster to display the cluster menu.It's time to stress your cluster out by adding a large workload. This will invoke the conditions needed to see autoscaling in action.
To do this, you are going to simultaneously run the same query in three Query editor tabs. This will create a lot of sudden work for the cluster, which will force autoscaling.
aws-us-east-2-autoscale
cluster.It's time to run the first query. Remember that you're going to run this same query three times simultaneously in three separate query tabs to create a spike in workload.
To do this, you'll need to complete this step and the two next steps in quick succession.
SELECT SUM(quantity) FROM tpch.sf100000.lineitem;
Quickly run the second query in a second query tab. This will cause both queries to run simultaneously.
aws-us-east-2-autoscale
cluster from the cluster drop-down menu.SELECT SUM(quantity) FROM tpch.sf100000.lineitem;
Now it's time to add the third query in a third query tab. Running all three queries simultaneously will cause a spike in workload, triggering autoscaling.
aws-us-east-2-autoscale
cluster from the cluster drop-down menu.SELECT SUM(quantity) FROM tpch.sf100000.lineitem;
You are going to watch as autoscaling occurs, adding an extra worker to compensate for the CPU load caused by the three simultaneous SQL statements.
It takes between 4 and 5 minutes to add the second worker.
The three SQL statements will cause the CPU on the single worker to reach 60% very quickly. It takes between 4 and 5 minutes to see a second worker because that is the time it takes a cloud vendor to both respond to a provisioning request and complete the hardware validation checks to ensure the worker is properly online.
You've seen how resources can be increased, it's time to see them decrease.
To do this, you are going to reduce the workload for the cluster below the 60% threshold, triggering the shutdown of the extra worker that was added in the previous step.
You've created the conditions for the cluster to begin an automatic scaledown. It will take between 15 and 20 minutes for this process to complete. As noted at the beginning of this tutorial, scaling down takes longer than scaling up to prevent the unwanted behavior known as flapping.
All scaledown events allow for the non-disruptive shutdown of workers. If this was a production cluster running SQL, it could take longer to scale down if SQL fragments were running across all workers.
It's time to turn your attention to manual cluster scaling. Like autoscaling, this process is also non-disruptive. Unlike autoscaling, there is more direct involvement in the manual process.
To test this, you are going to monitor your cluster while you manually scale it up and down.
Watch the video below to see the process demonstrated, then complete the following steps on your own.
You're going to use the same cluster that you created earlier in this tutorial to test manual scaling.
aws-us-east-2-autoscale
cluster.Time to test manual scaling by forcing a manual increase in the minimum number of workers.
Although there are no queries running, the cluster will add workers because you increased the minimum number manually. This differs from autoscaling, which takes into account cluster workload.
As before, it takes between 4 and 5 minutes for the new worker(s) to be added.
Time to try the process in reverse, reducing the minimum number of workers.
As before, this process begins by editing the cluster.
aws-us-east-2-autoscale
cluster.When you manually adjust the min setting on the cluster configuration, the cluster will work to scale down in a non-disruptive manner.
All scale-down events allow for the graceful shutdown of workers. If this was a production cluster running SQL, it could take longer to scale down if SQL fragments were running across all workers.
The changes you've made to the cluster will be reflected in the Cluster Overview dashboard.
Remember that scaledown takes longer to execute than scaleup, so it will take 30 minutes or more for the changes to take effect.
Starburst Galaxy allows non-disruptive cluster scaling between cluster sizes.
To test this, you will set the min and max to the same value so that the only scaling that will occur is the manual scaling you are invoking by changing the cluster size.
You will use the same cluster from earlier in this tutorial, and edit it to invoke scaling of cluster size.
aws-us-east-2-autoscale
cluster.You're going to begin by manually changing the cluster size to Medium.
It takes between 4 and 5 minutes for the new worker(s) to be added.
Now you're going to go in the opposite direction, switching from the Medium tier to the Free tier.
Again, you will need to edit the cluster to make this change. In the Starburst Galaxy tab, click the ellipses to view the cluster menu for the aws-us-east-2-autoscale
cluster.
Switching between a Medium cluster and a free one requires a new cluster to be created with a different number of workers and different associated costs.
In such cases, the old cluster will remain running for up to four hours, allowing time for any queries already running to complete.
Time to flip back to the Cluster Overview to see the changes take effect.
You will notice 1 worker as soon as the new Free-Tier cluster is started.
Since we are done with testing, you can delete the cluster you created. This will help ensure that you don't incur any unexpected costs.
aws-us-east-2-autoscale
cluster.Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of non-disruptive cluster scaling with Starburst Galaxy.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!