Inside Starburst’s hackathon: Into the great wide open(AI)

Share

Generative AI has been taking the tech world by storm lately. Indie developers to large enterprises are experimenting with its impact not only in day-to-day jobs but also in driving new feature innovations. 

Starburst is no different. We’re always looking for ways to improve our technology and make the lives of data engineers easier. The current buzz around OpenAI lined up perfectly with our yearly Hack-a-Trino, and our team came up with some pretty cool concepts. 

Check out the top three projects below and let us know which one is your favorite by voting on r/dataengineering. Who knows you might just see the winning concept become a reality.

  • Project A: Automatic Data Classification (LLM-based tagging)
  • Project B: Trino AI Functions
  • Project C: No Code Querying with ChatGPT

Project A: Automatic Data Classification

Built by Alex Breshears, Cody Zwiefelhofer, David Shea, Elvis Le

What is it? Automatic Data Classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. This project uses OpenAI’s LLM APIs to take samples of datasets and propose potential tags to data admins.

Tags in Starburst Galaxy allows users to associate attributes with one or more catalogs, schemas, tables, views, or columns. Tags can be combined with attribute-based access control policies to ensure that each role has the appropriate access rights to perform actions on entities in the system.

Why? This model automatically identifies data structures within the data. As such, it takes less time for data admins to learn what kind of data has been ingested and appropriately mark that data – increasing productivity and accuracy of tags.

Architectural Considerations: A critical requirement for this feature is that the data itself must be analyzed – not just the metadata. There are many reasons for this but the most obvious is columns are rarely named as what they are. 

Project B: Trino AI Functions

Built by Karol Sobczak, Mainak Ghosh

What is it? Trino AI Functions (aka LLMs in the context of complex aggregations) uses ChatGPT and Hugging Face to translate natural language text into complex aggregation logic that can be executed directly on the Trino engine.

Why did we build it? SQL is heavily used for data analysis and at the same time the democratization of ML means that customers now want to do more with Trino queries. Trino AI Functions will help customers write expressive queries that can do language translation, fraud detection, sentiment analysis and other NLP tasks. 

Architectural Considerations: One powerful feature of LLMs is that you can ask models to provide you a structured answer from unstructured data like JSON documents. 

Project C: No Code Querying with ChatGPT

Built by Lukas Grubwieser and Rob Anderson

What is it? No code querying with ChatGPT is a feature that would let Galaxy users write natural language queries against data sources connected to Starburst Galaxy. 

Why did we build it? This would allow business analysts to ask questions of Galaxy without having to write complex SQL or understand the underlying data architecture. It brings Starburst one step closer to our vision of data democratization. 

Architectural Considerations: This project requires a three-part architecture: a frontend, OpenAI engine (with backend server), and Starburst. The frontend takes the question and OpenAI translates it to SQL which is then processed by Starburst. 

Honorary Mention: Accelerating Cyber Analytics with Aderas, Starburst, and ChatGPT

While this next effort wasn’t part of our internal hackathon, we thought it was too cool not to share. Our partner Aderas built an POC of an insider threat investigation model using Starburst and ChatGPT for cyber security analytics. 

Check out the demo on YouTube.

Starburst’s Learnings on Gen AI

While we built a lot of cool things during the hackathon, we also learned a lot. We documented a couple of our team’s key learnings below:

  • It’s really shown us how quickly we can generate new features and ideas that traditionally were a lot harder for a business like ours to innovate on.
  • LLM definitely won’t solve everything, but it’s a good starting point. 
  • It’s been fun to iterate on, but we’re also largely waiting for our model to become more correct (since we think correctness matters!) around the query generating. However, it’s exceptionally cool to see what the newer models are capable of, in terms of generating syntactically correct ANSI SQL.