Compute Cost Best Practices: How to Optimize Data Costs Across All Architectures

  • Evan Smith

    Evan Smith

    Technical Content Manager

    Starburst Data

Share

Rising compute costs are often the unwanted price of success in the data world. Usage-based pricing models mean that the more data you use, the more it costs you.

But what if you want to reduce your compute costs as much as possible?  While it is not possible to eliminate cloud compute costs entirely, there are ways to get the most for your money by reviewing your data architecture and optimizing it for cost. 

This article is here to help you unpack the best strategies to reduce your compute costs and maximize the benefit of the cloud data lakehouse architecture to take advantage of emerging competition in compute

What you need to know about cloud compute costs

To understand cloud costs, it’s important to understand the cloud computing ecosystem itself. Any savings ultimately derive from the cloud environment. 

Unpacking the cloud

As far as data analytics is concerned, cloud services is dominated by three players: 

These three cloud vendors compete for the majority of cloud storage and compute workloads. Because most costs in the cloud ultimately derive from the pricing models of these platforms, understanding these costs will help you understand your total bill. 

Separation of storage and compute saves money

One of the architectural selling points of cloud computing is the ability to scale storage and compute resources independently. This allows you to adjust compute as needed, fine-tuning your precise allocation of resources.

Retaking control of your cloud compute

Although usage-based pricing can drive up costs, scaling workloads properly can help you push them down again. This general approach is the best solution for optimizing your data costs across all architectures. To be successful, you need to analyze your usage, identify areas where you have allocated too many resources, and reduce these resources accordingly. 

How should you get started? The following strategies outline 4 practical ways to reduce your cloud computing costs. 

4 practical ways to reduce your compute costs

Compute costs scale quickly, but you still have some control over them. Compute pricing is based on use. To reduce costs, you should find redundant areas where you don’t need to use compute as a starting point for evaluation. 

Overall, we can summarize all of the points above in 4 actionable tips: 

1) Review your cloud data architecture holistically

Consider your compute costs holistically and determine your total cost of ownership. Is your current strategy less expensive overall than the alternatives? 

Chances are, your cloud data architecture falls into one of the categories below, and each of these generates compute costs at a different rate. To help get a handle on this, you should periodically review your architecture to ensure that it provides the best combination of cost and performance for your needs.

Cloud data warehouse

This is the most traditional and power-consumptive of the data architectures. Data warehouses may require ETL on data entering the warehouse using a schema-on-write process. This ratchets up your compute costs, even for data that you may not ultimately need or use. Once this process is complete, your compute costs continue. Every query, whether recurring or ad hoc, adds to your compute costs. Data warehouses exist to run queries, so you can expect high compute costs added to your ETL costs. 

For this reason, the cloud data warehouse is often the most expensive data architecture available, particularly if you are transforming a lot of data and running a lot of recurring and ad hoc queries. 

Cloud data lake

Data lakes are designed to offer a lower cost alternative to data warehouses. They store large amounts of data cheaply, using object storage. Data is stored in its raw state, using a schema-on-read process. This means that ETL is performed when it is needed. This helps save on compute costs by targeting resources towards usable data. 

One area where data lakes are less efficient for compute is Hive. Apache Hive is a table format used in non-lakehouse data lakes. Compared to modern table formats like Apache Iceberg, Delta Lake, or Hudi, Hive does not collect as much metadata about files when they are updated or deleted. This means that updating records requires entire folders to be rewritten. This adds to storage costs and also drives up compute. 

For this reason, data lakes do not represent the most efficient way to use compute resources when compared to data lakehouses. 

Cloud data lakehouse

Data lakehouses are built on the same object storage used in data lakes but make use of three modern table formats instead of Hive: 

  • Apache Iceberg
  • Delta Lake 
  • Apache Hudi

Each of these table formats differs slightly. However, what they all share in common is the ability to update and delete records easily, making them ACID compliant, in addition to other advanced mechanisms like partitioning, partition evolution, and time travel.

From a compute costs perspective, data lakehouses are the most judicious in their use of resources. Because changes in the state of the dataset are tracked, compute workloads are always run off the latest version, vastly improving efficiency. 

Data lakehouses can optimize compute costs due to their efficient use of resources and advanced table formats. For this reason, if you are reassessing your architecture, it is highly recommended that you consider a data lakehouse format, particularly one deploying an Icehouse Architecture using Apache Iceberg. 

2) Eliminate over-provisioning

Over-provisioning happens when you allocate more compute resources than you need to complete a given query or workload in the desired time. To reduce costs, identify and address over-provisioning by testing and adjusting compute resource allocations to ensure they match actual needs.

To fix this, you should treat over-provisioning as an experiment. 

  1. Begin with an hypothesis: You are paying for more compute than you need. 
  2. Proceed to a test: Reduce compute by a moderate level and observe the effect. 
  3. Gather results: Reducing compute caused no noticeable reduction in query time, but a large reduction in cost. 
  4. Repeat: Reduce compute over-provisioning until you notice negative effects that outweigh the benefits of reducing compute. 

Shut down unneeded running services

Another part of the over-provisioning problem is services running when they are no longer needed. This might happen because you neglected to shut off the service, or because you are keeping it running in anticipation of a future need that has not come. In either case, solving the problem is the same. You should audit each running process to see whether any are running longer than needed, and shut down any services that do not make the cut. You might consider a regular cadence of review. For instance, yearly, bi-yearly, or quarterly review. 

Identify and reduce redundant storage

Compute and storage costs are intertwined. As you store more data, and expose that data to queries, your compute costs increase. Often this increase is desirable, but you should always review whether the data you store is actually needed and whether queries need to include it. You might be storing data that you don’t need and exposing it to queries by mistake, causing your compute resources to work harder than they might need to get your results. 

To fix this, audit your current processes and determine whether you are overprovisioning resources or allowing processes to run when they are not needed. Regularly reviewing your dataset and reducing redundant storage will lower the workload for your compute resources, saving you money. 

3) Remove unnecessary components

Different tools are designed for different scenarios, workflows, and user types. To help pick the best data stack for your own needs, you need to begin by understanding your workflow from beginning to end. 

To help fix this problem, you can conduct a review of your needs and replace any areas of inefficiency. Try to understand the ideal user of each tool, and replace any tools that are not designed for your workflows. 

Review your storage strategy and identify any areas where data is being copied more times than it is needed. Not only will eliminating these areas save you money on storage costs, it will also reduce your compute too. 

4) Consider pricing models

Let’s say you’ve already identified the obvious ways you can reduce your compute costs, but your costs still seem too high. 

What else can you do? One approach is to review compute pricing models associated with the tools you use, and understand how those charges are calculated. 

For example, consider the pricing models below for ElasticSearch and AWS Athena. 

ElasticSearch indexing charges

If you use ElasticSearch, you should review the costs associated with indexing. Because ElasticSearch charges for indexing, it can be a silent source of compute cost increases that might otherwise go unnoticed. 

To fix this, you should review your indexing charges and assess whether they fit within your budget or not. 

AWS Athena data scanning charges

AWS Athena charges to scan data. This can also be a silent source of compute cost increases for Aethena users. If your Athena bills are larger than you expect, you could take a look at scanning charges, and assess whether alternatives to Athena might solve this problem. 

Use Starburst to help reduce your compute costs

Starburst is designed to give you control over your data stack, and that extends to managing compute costs. It’s part of our approach to giving you options when it comes to using your data. 

If you want to know more about cloud data lakehouses, especially those using an Icehouse architecture, Starburst has a lot of resources to help get you started.