Do you have a data velocity problem?
Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data


Share
More deployment options
We all know that the volume of data is increasing. In fact, information collected by IDC and Statista shows the total amount of data in the world growing to 394 zettabytes—over double current rates—by 2028. But it isn’t just the size of data that’s increasing; it’s also the velocity of that data. To understand how that issue impacts big data analytics, it’s worth diving directly into the question of data velocity.
How did we get here?
Rapid growth is taxing existing data management systems, which were not built to handle current data volumes or data velocities. This deficiency can cause problems. First, it can cause decaying query performance as your infrastructure struggles to address your datasets‘ size and speed. As a result, your team may spend more and more money on compute just to keep query, update, and delete operations within acceptable thresholds.
In this article, we’ll examine the two types of data velocity issues, how to distinguish them, and the best data architecture for achieving the performance you need.
What is data velocity?
Data velocity describes the speed at which data is generated, updated, or deleted. It’s an important term and one of the five Big V fundamentals of data that drive data complexity.
Data velocity and data growth quickly grow as businesses move towards a data-driven decision-making culture. This is especially true as companies onboard support for real-time scenarios.
How capturing high-speed data helps your business
For example, you may want to build a social media sentiment analysis engine, analyzing public conversations discussing your company’s product and judging whether they’re positive or negative. This is a sizable undertaking. In any given minute, Facebook users post 293,000 status updates and 510,000 comments. Meanwhile, on YouTube, users watch 4,146,600 videos every minute. Moreover, this data is updated constantly. It’s an evolving landscape that only gets more complex with time. Handling this firehose of data requires having systems that can capture, store, and analyze data at a torrential rate.
Similarly, you may want to capture data from Internet of Things (IoT) devices to perform activities such as detecting anomalies and ensuring production safety in manufacturing equipment. To succeed, you’ll need to collect a constant stream of data from hundreds or thousands of sensors and analyze all of it using algorithms to determine if the devices are operating normally.
Why you might have a data velocity problem
In the past, solutions like data warehouses were built on the assumption that datasets would change infrequently—both in structure and in the underlying data itself. Often, in the early days of data, this was true. Data rarely changed its schema, and updates and deletions to records occurred predictably. These systems were built to support reporting use cases. In these scenarios, the incoming data only had to be updated periodically through batch ingestion at a regular cadence–typically once a day, week, or month.
All of this has changed with modern datasets, which use new data sources that operate at high velocity and unprecedented scale. To make matters worse, teams often use the same architecture they built for reporting to handle this new data landscape. Many of these are write-heavy, involving high-velocity and real-time data, and these systems often can’t keep up.
The results in one of two issues:
- Your data velocity is too slow
- Your data velocity is too fast
Let’s look at these scenarios one by one.
Your data velocity is too slow
In this case, your system is largely used to handling one type of data: highly structured data stored in a data warehousing solution like Snowflake, Amazon Redshift, or Hadoop. This often prevents you from even considering real-time scenarios because you have no efficient way to capture and store the data – which keeps you stuck in the world of “slow” data.
Example of data velocity that’s too slow
For example, imagine an e-commerce site that decides it wants customers to be able to track their delivery status. Next, let’s assume this company is the scale of Amazon, which processes over 660,000 orders an hour. This scale requires gathering real-time GPS sensor data from hundreds of thousands of drivers across thousands of different delivery partners.
Even if the company is 1/10th of Amazon’s scale, it might have trouble absorbing all of this data with a traditional data warehouse solution. A solution like Hive, for example, is built to work at a folder level. This makes both reads and updates grow slower as your volume of data grows. Additionally, Hive doesn’t track all of the metadata changes that more modern solutions do, which would enable it to boost CRUD performance.
After prototyping its current systems, the company realizes its data architecture isn’t built to store all this data. As a result, it abandons the project, staying stuck in the world of slow data.
Alternatively, suppose a company decides to move from its current formatting for sales reporting into something more dynamic, where each deal captures data differently to reflect the unique structure of each deal type. This requires changing the underlying data schema.
Unfortunately, this company also uses a solution like Hive, where schema changes require rewriting the entire data set and performing manual partitioning. It can be done – but it’ll be a long and expensive project. In the end, leaders decide the cost and investment of resources required isn’t worth it.
In both situations, data teams face the same problem. Fearing that their infrastructure isn’t up to the task of ingesting fast-moving data, they intentionally keep their data slow. As a result, they close themselves off to any scenarios that would require high-velocity data.
Your data velocity is too fast
Meanwhile, other teams may have a related but different problem. Often your data velocity is already high – but your architecture doesn’t enable you to capture and work with the data quickly enough. Most of this data isn’t the highly structured data of the data warehouse but rather semi-structured or unstructured data.
Teams in this situation often try to use an architecture built for batch processing to handle near real-time data scenarios requiring frequent writes and updates of large amounts of data. This includes use cases such as:
- Transportation and logistics
- Real-time fraud detection
- Internet of Things (IoT) device data acquisition (e.g., factory floor sensors)
- Real-time analytics
Example of data velocity that’s too fast
Let’s say our e-commerce company mentioned above did some testing and decided it could use some of the streaming features built into Hive over the years to get the performance it needed out of its existing system for a real-time delivery status update system.
It launches it. It works at current data volumes. Customers love the feature. They love it so much that new customers flood the site, and orders increase by 2x within a month.
Suddenly, the system starts to struggle. The architecture’s limitations make it difficult to keep the solution performant. The data team can get the data but can’t efficiently get it to consumers and derive new solutions from it.
As a result, this company’s data team spends more and more time not creating new features and capabilities but working around their architecture’s limitations to get the performance they need. Most days, engineers find themselves in a mad scramble to keep everything in production operational.
How do you recognize if you’re getting to this point? Practically speaking, one sign that a team is in this situation is that it’s throwing more compute at the problem to keep data analysis moving efficiently. This works for a while. However, they eventually hit a point where the cost of utilizing more compute exceeds the value of the business benefit they’re reaping.
The best way to improve data velocity
These two data velocity issues are slightly different. However, they both have a similar solution: moving workloads to an open data lakehouse.
What to do if your data velocity is too slow
An open data lakehouse is an evolution of the data lake. It allows you to work with warehouse-like analytics at scale across different file formats and data types. Data lakehouses are particularly useful for housing structured data, unstructured data, and semi-structured data in a single location.
The open data lakehouse uses an open table format, such as Apache Iceberg, as a wrapper around your object storage, providing an intermediary layer of metadata that improves performance for both queries and updates of your data. In particular, Iceberg addresses a number of the limitations built into Hive:
- A file-based architecture that yields better overall performance and efficiently compared to Hive’s folder-based architecture. This architecture is capable of handling hundreds of petabytes of data in a single table.
- Richer metadata than Hive, including two levels of metadata filtering – manifest files and a manifest list – to improve both query and update performance.
- Changes tracked in snapshot files to enable faster updates and deletes.
- In-place schema evolution that guarantees structural changes are independent and free of side effects.
- More advanced partitioning, including hidden partitioning (so data consumers don’t need to understand data layouts to work with data efficiently) and partition layouts that can evolve over time.
If your data velocity is too fast
There are workarounds that will enable Hive and traditional data warehouses to handle high-velocity data. However, since these work against Hive’s basic architecture, they’re often only achieved at great time and expense. That’s why most teams decide not even to go down this road and, instead, remain stuck in the world of low-velocity data.
By contrast, the open data lakehouse is built from the ground up to handle high-velocity and real-time data – and to do so without complicated or hard-to-maintain workarounds. By using an Icehouse architecture – Iceberg in conjunction with the massively scalable Trino open query engine – you can create reliable ETL data pipelines and ingest data from various real-time sources. You can then make this data available to consumers, who can rest assured that it’s accurate and up-to-date.
How to migrate your workloads
No matter which data velocity issue you’re facing, moving the impacted workload to an open data lakehouse is a solution that allows every member of your team—data scientists, engineers, and business decision-makers—to work with data at scale more quickly than ever before. That, in turn, gives your business a competitive advantage over companies still stuck on outdated architectures.
This doesn’t mean you have to move all of your workloads. The key is adopting a solution that gives you the flexibility to choose where and how you run your most critical data projects.
Starburst is a flexible open data lakehouse solution that combines Trino and the Iceberg to enable fast operations over hundreds of petabytes of data. It can integrate data from numerous data sources – including Hive – to power solutions for various data workloads. Starburst also uses data products, real-time analytics, and Artificial Intelligence to get the most out of your data.
Using Starburst, you can achieve the balance you need between cost and performance. For example, using Starburst Galaxy, you can use cluster autoscaling to achieve the mix you need without manual fine-tuning.
To learn more about what a Trino + Iceberg architecture can do for your data team. Read about why the Icehouse exists and the problems it’s designed to solve.