Understanding data analytics vs AI
Evan Smith
Technical Content Manager
Starburst Data
Evan Smith
Technical Content Manager
Starburst Data


More deployment options
When it comes to analytics, deriving value from data has always been the goal. Specifically, the aim is to access, transform, store, and query data at high volume and low latency. The final goal is to create accurate, up-to-date information that can drive business decision-making.
Artificial Intelligence vs Data Analytics
Where does Artificial Intelligence (AI) fit into this, and where does that leave data analytics? One way to think of AI is as yet another way to derive value from data. In this sense, it’s complementary to analytics. Supporting AI workloads doesn’t change this fundamental goal of deriving value from data. What it does change is the steps you’ll take to get there. In effect, it opens a second front for your data. It’s an opportunity to get more value than ever before.
Fortunately, this doesn’t require a fundamentally new approach to handling data. Rather, it presents an opportunity to extend our data architectures to handle analytics and AI workloads together.
In this article, we’ll compare and contrast data architecture for AI and data architecture for analytics. We’ll also look at how to extend your existing analytics data architecture to support both, to help meet the growing demands of AI in your industry.
Data architecture for analytics
Data architecture built for analytics centers around queries. You can think of the query as the basic building block of the analytic world. Queries are essentially a question asked of a dataset. However, they’re also a very specific kind of question posed in a very particular way. Specifically, they represent a linear method of extracting data that produces a deterministic result.
Although there are many different kinds of queries, all analytic queries rely on three key components of your data architecture to succeed:
Data storage
You can’t query data unless you store it somewhere. Traditionally, this involves bringing data from multiple sources together into a single location for efficient querying. It also usually means storing data in a highly structured format, such as a table in a data warehouse structured into a schema that incorporates fact and dimension tables. Increasingly, it also means using a data lake or data lakehouse.
Data ingestion and transformation
Raw data is rarely ready to be queried directly. To generate value from it, you need to apply data transformation techniques to it. This stage also focuses on data quality, ensuring our data is accurate and free from anomalies, including missing values, malformed fields, duplicates, inconsistencies, etc.
This is done using one of two approaches:
- A fixed batch process that runs at regular intervals (hourly, daily, etc.)
- A real-time data stream that updates data as it’s created (e.g., from factory sensors, transaction processing systems, etc.)
Data analysis and exploration
Finally, analysts and business users use data discovery techniques to turn transformed data into insights. To do this, they apply additional business logic using ad-hoc queries via a SQL query client or a SQL query engine like Trino that can be rendered in tables, charts, and other representations using Business Intelligence (BI) tools.
Data architecture for AI models: How to get more from your data
Analytics systems enable business users to mine data for business insights in a deterministic fashion. In contrast, AI systems are autonomous, self-improving probabilistic systems that can provide insights and make predictions based on their training.
“AI” is a general term that can mean one of two things: traditional (“weak”) AI based a prediction model or neural network (e.g., a recommendation engine); or Generative AI, which creates new and unique outputs based on existing patterns (e.g., responding to a customer query in a human language such as English).
Analytics models typically use their own declarative custom query language, such as SQL. Sometimes, this is combined with a more flexible imperative programming language such as Python. By contrast, AI engines receive inputs in multiple formats, including natural language queries and machine-generated queries.
Additionally, in analytics, data is primarily text and numerical data in a highly structured format. By contrast, AI may involve multiple types of data, such as semi-structured data and multimodal data, including images, video, and others.
Importantly, data in AI systems is used in one of two key ways:
Large Language Model (LLM) training
LLMs are learning models trained on massive data sets that can understand human language and generate text, images, videos, and other artifacts. Some of the most used include the GPT series, Claude, Gemini, Llama, and Mistral.
Whereas data in analytics architectures is directly queried, in an LLM, it’s preprocessed and fed into a deep learning framework. After a model is trained, it’s fine-tuned and refined through a number of test cases. The LLMs transformer models can understand context, which is what gives LLMs the ability to generate human language through probabilistic analysis in response to user queries.
This approach makes LLMs versatile. Models like GPT and Llama can respond to a wide variety of questions, not just to the specific ones that they’re trained on.
Format of data for LLM training
The format in which you store data for an LLM will differ depending on whether you’re fully training your own model or fine-tuning in an existing one.
For full training, you’ll use an approach that shards documents, computes hashes, and adds metadata annotations. One example of how to implement this is the CCNet pipeline.
Many teams will opt instead for fine-tuning, which takes an existing LLM and retrains it on a subset of its initial parameters. This costs much less and takes far less time than fully training your own model. Data formats used for fine-tuning depend on the algorithm used, but can include one shot learning (sets of correct question/answer pairs), direct preference optimization, and RLHF (Reinforcement From Human Feedback).
Retrieval-augmented generation (RAG)
LLMs are great as general language generation models and excel at answering generalized questions based on common knowledge. However, they lack the context needed to answer questions relative to your business use cases. Part of the challenge of creating GenAI applications is providing this domain-specific context in an efficient manner to yield accurate answers from an LLM.
You could address this by building your own LLM that directly incorporates your proprietary data. But this is so expensive and time-consuming that most companies don’t bother. Even fine-tuning requires a large investment — and, in many cases, it doesn’t even give you the boost in accuracy you were hoping for.
The easier, cost-effective way to add your data is via retrieval-augmented generation (RAG). With RAG, you store pertinent context in an external database that you can query to find related information. You then include this information in the prompts you send to the LLM.
Why RAG matters for AI architecture
RAG results in more accurate and more timely results from LLMs. At the same time, it requires a fraction of the time to implement compared to training your own LLM or even fine-tuning an existing one.
You can further refine RAG by using agents, autonomous code that leverages RAG and other external tools to provide dynamic responses to user requests. Agents built using RAG can provide greater accuracy by refining and optimizing their data, and combining data from other external tools (e.g., external APIs), to specialize in requests related to a specific problem domain. You can then build advanced GenAI solutions by creating networks of specialized, interconnected agents.
You can evolve agents even further using agentic RAG, which leverages LLMs themselves to create custom query plans for each request based on its complexity. Agentic RAG systems can adopt workflows to meet complex requirements and use iterative refining and contextual understanding to provide even greater accuracy than static RAG pipelines.
Format of data for RAG training
Because RAG typically doesn’t use structured data like an analytics data architecture, it needs to store data in a different format that enables finding similar results across a large data set. This usually means using either a vector database or a graph database.
Both storage formats require an approach to breaking up data into semantically useful units, a process known as chunking. Approaches to chunking include fixed-size chunking, content-aware chunking, and specialized chunking that takes advantage of semi-structured formats such as Markdown and LaTeX.
For vector databases, we then convert this data into vectors, a mathematical structure with a size and direction. These vectors are then embedded into a multi-dimensional vector space using an embedding model, which is created by passing labeled data to a neural network. User queries are then themselves turned into vectors and results discovered by finding similar vector embeddings using a nearest neighbor search.
Graph databases take a different approach, formatting data as nodes (entities) linked by relationships (edges). The graph can either be an entity graph, which identifies real world entities in the underlying data set, or a content graph, which identifies links between documents (e.g., web pages). You can then find context related to a given node by traversing this graph structure.
The challenges in building data architecture for analytics and AI
Analytics and AI use cases store and leverage data in different ways. But the fundamental challenges of managing that data —ensuring data quality, data security, data governance, et cetera — remain the same.
Ideally, you want a hybrid data architecture. Usually, we use “hybrid” to denote a combination of cloud and on-premise architectures. However, there’s a new sense of hybrid taking hold —data architectures that serve both analytics and AI. Think of it as building two highways that branch off in different directions based on your use cases.
There are numerous challenges in getting to this ideal state, including:
Data access
It may be difficult to make the best use of the data in your company because you simply don’t know where it is or how to make use of it.
For analytics, lack of a discoverability mechanism for data means that data goes “dark.” In other words, it lays dormant, not generating business value – and even costs you money in terms of compute and storage. Similarly, for AI, data may be locked in legacy systems because no one has the bandwidth to port it into a more modern data format.
Data collaboration
Data needs to be analyzed and stored in an environment that works for you and your business. Many of today’s data scenarios – fraud detection, personalization, real-time analytics, etc. – require processing data in real-time. That constrains existing data architectures —honestly, most existing analytics implementations —that are built on a batch processing model.
Data governance
Heterogeneous data environments, in which data is stored across dozens if not hundreds of different data storage systems and modalities, have always made data governance a challenge. AI, unfortunately, makes data governance and data security even more challenging:
- People may submit personally identical viable information (PII) to AI models. This may become part of your future data set, inadvertently end up in cleartext logs, etc.
- Hackers now have new ways to pry data out of systems, such as cajoling an LLM to violate its programming and leak PII or intellectual property.
Creating a data architecture for both analytics and AI models
The bad news is that older data architectures aren’t up for solving this problem. Data warehouses, data lakes, and conventional data lake houses can’t run quickly or cost effectively enough to deliver data quickly and accurately for both sets of use cases.
The good news is that you can evolve your existing data stack to support this hybrid approach to data without gutting it and starting from scratch. An open data lakehouse solution like Starburst works with your existing data architecture to support both modern analytics and emerging AI use cases:
- It’s built on the Icehouse—the combination of the Apache Iceberg open data format and the Trino SQL query engine — for openness and superior performance
- It’s decentralized —you can choose which workloads to bring into the Icehouse and which to leave in place for later migration, while still providing accessibility to your existing data in legacy systems
- It’s suitable for real-time use cases – you can feed new data quickly to both your analytics pipelines and your AI data stores
To learn more about how Starburst can help you migrate to a hybrid data architecture that serves both analytics and AI, contact us today.