Presto & Data Science: Getting Data Into the Hands of Data Scientists (Faster)

Share

A few days ago I read a Gartner report stating that data scientists spend 23% of their time on data collection and preparation. I think that’s low. At my previous company I specialized in ETL, and based on that experience, and the conversations I’ve had with the data scientists at some of our Starburst customers, I’d guess it’s closer to 50%.

But let’s say that 23% figure is accurate. That’s at least two hours of the average day wasted on data wrangling. This is not the kind of work data scientists envisioned when they signed up. They want to spend their time on data analytics, building machine learning models and running regressions, not cleansing data. You don’t spend years working on your PhD in mathematics only to spend hours every day reformatting rows and columns of data. That’s like asking an emergency room doctor to spend a quarter of her day organizing shelves of medical supplies.

R, Python, and Presto

This isn’t just a problem for data scientists. At a higher level, the business loses. If your highly trained data scientists can’t access the data they need to generate results, you’re tanking their efficiency.

Generally, data scientists are trained and work in languages like R and Python, which facilitate complex mathematical questions. So how can a high-performance SQL query engine like Presto help?

I realize data scientists typically shy away from SQL because it’s either not suited to your needs or isn’t part of the standard skill set, but one of the fundamental issues here is that you don’t have fast access to all your data, which often resides in multiple places, such as cloud data lakes, Hadoop, proprietary data warehouses and Delta Lake. Querying data where it resides is Presto’s specialty, and the solution is set up in such a way that data scientists can focus their time on analysis as opposed to data acquisition and wrangling. You can remain in R Studio or Jupyter and allow Presto to operate behind the scenes and quickly serve up all the data you need.

This is something we probably don’t talk about enough, but Starburst Enterprise Presto, in particular, isn’t just agnostic to the data source, but also to the target application or language. Starburst queries data in multiple sources where it resides, then serves this data up to a wide range of applications, plus tools like R Studio, Jupyter, and Tableau.

How & Why It Works

Let’s say you’re operating in R Studio. There’s an easy-to-download, open source R Presto package available here. So you download that, establish the connection to the data sources you want to query, and get started. At this point, everything is in R, so you have the full power of the language to do what you want to do. For python and jupyter notebook users there is an analogous presto-python-client here.

 

Data Science Blog

 

Meanwhile, Presto runs in the background, querying data in Hadoop, Oracle, the cloud data lake of your choice, or any one of Starburst’s dozens of supported data sources. You get all the benefits of our federated query engine, but the tool is embedded in the R or Python code. This gives you:

  • Faster Results. First of all, you’re going to get at least two hours of your day back. Starburst Presto is designed to run high-performance queries, eliminating the need for ETL. One of our customers had to wait 10 minutes for results when he tried to query his data lake. With Starburst the operation takes him 47 seconds.
  • Better Results. The pressure to deliver insights sometimes forces data scientists to run their analyses on whatever portion of data they can access quickly, but leave out the data in a separate warehouse or data lake that would take too long to load. Or they might disregard a data lake because the tools they use to access that data are just too slow. Since Starburst allows you to quickly query all your data, you end up with more reliable, comprehensive results.
  • More Creativity. As a data scientist, the ability to remain in your chosen language, access more data, and free up more hours in your day is going to free you to do the high-quality work and critical, creative thinking that makes data science such a critical piece of today’s enterprises in the first place. Additionally, the ability to easily add and access additional data sources opens up further opportunities for data investigation and data mining that would either be impossible or slow to materialize with traditional methods.

How Our Data Science Customers Use Starburst

We’re seeing data scientists across our customer base take advantage of Starburst Presto, and here are some common use cases:

  • Commercial banks are analyzing risk to get a better understanding of their market exposure for various investments.
  • Financial services analysts are modeling trading activity to understand market trends and trading patterns
  • Data scientists at pharmaceutical giants are digging into manufacturing and supply chain networks to find ways to improve operational efficiency.

Understanding customer attrition is another common function – given access to the right data, regardless of where it resides, the data scientists at our customers have been able to model how likely specific customers are to leave for a competitor, and which sales or marketing campaigns might entice them to stay.

The Gartner report also addresses the preference for open source tools among data scientists, arguing that proprietary offerings of those tools should be considered, since they end up saving data scientists additional time and effort. This, once again, allows you to focus on more interesting projects. Obviously, we’re biased, as we provide an enterprise-ready distribution of open source Presto. But go ahead and try the open source version first, and start freeing up some of your day for the work you were meant to do in the first place. Then let us know what you think, and how Starburst can help you take your data science projects to the next level.