SQL: Structured Query Language
Structured Query Language (SQL) is a programming language designed to manage and query datasets. Originally, it was used exclusively to query structured data held in relational databases and organized into tables, rows, and columns using a defined schema. Today, it is also used to query and manage data held in data warehouses, data lakes, and data lakehouses, whether structured, semi-structured, or unstructured data. It also plays a pivotal role in AI models preparing data for consumption. Overall, SQL’s versatility and historical impact on the data industry has made it the dominant language of data science and engineering.
What is a query?
SQL’s primary job is to ask questions about a dataset. We call these questions queries. Queries retrieve and compare structured data from database tables.
For example, a financial institution might use SQL in several different ways. First, an economic research team may use it to analyze data and look for financial trends. Meanwhile, the sales team might use it to identify customers who might benefit from particular products. At the same time, the security team might guard against financial wrongdoing by noticing trends and flagging them. Finally, the data team might also use SQL to perform operations on datasets, transforming data from different sources so it can be compared by the data scientist.
In each case, SQL queries allow users to access their data, transform it into a useful state, and derive analytic value from it. This journey, from raw data to business insights, is the true purpose of SQL and the main value that it provides businesses.
How does SQL let you access data?
Data is more common than ever before. In fact, every website, app, and interaction generates data. Meanwhile, although the rate of data is always increasing, it is only valuable if you can access and use it effectively. In this sense, even the largest datasets are useless without a way to extract, analyze, and interpret them.
This is where SQL comes in. First, it translates raw data created by applications into meaningful insights using data analysis. Second, it is also used to move data between systems and change the structure of that data using an ETL process. Finally, it is used across the full spectrum of data analytic technologies, including data lakehouses, data lakes, and data warehouses.
SQL and relational database management systems (RDBMS)
Traditionally, SQL plays a large role in relational databases, such as PostgreSQL and MySQL. In these systems, database administrators use queries to manage structured data conforming to a schema. Typically, each table represents a specific entity (e.g., customers, orders). This structure is known as the schema.
For more information, check out Starburst Academy’s Exploring databases course for more information on using relational databases.
Online transaction processing (OLTP)
Importantly, SQL is also an integral part of online transaction processing (OLTP) systems. This type of database handles transactional data, a fast-moving, high-velocity data type. In this context, queries power high-speed, real-time data operations.
OLTP prioritizes fast, concurrent processing, often using indexes and primary keys for rapid access. This makes SQL the backbone for many critical, high-volume applications, from banking systems to e-commerce platforms.
Data warehouses
Building on SQL’s role in relational databases, data warehouses also rely heavily on queries for structured data management. This is often done through online analytical processing (OLAP), a process that enables multidimensional analysis of large datasets, ideal for tasks like sales forecasting and trend analysis.
Data warehouses use SQL to query and analyze data across multiple tables. This enables complex reporting and analysis, making it a vital tool for deriving insights at an enterprise scale and supporting everything from sales forecasting to performance analysis in a streamlined, efficient manner.
Examples of data warehouses that use SQL include:
Snowflake: Snowflake is a SQL-based cloud data warehouse (CDW).
Amazon Redshift: Amazon Redshift is a managed data warehouse service offered by AWS and powered by SQL.
Google BigQuery: Google BigQuery is a serverless data warehouse that uses SQL as its main language.
Microsoft Azure Synapse Analytics: Microsoft Azure Synapse Analytics is a unified analytics platform.
Non-relational data
SQL is also used to query non-relational data sources, and this is one of the main uses of query engines like Starburst. Queries of this type can access data held in different formats, including JSON, Parquet, and Avro.
Here are a few popular examples:
- MongoDB: A non-relational database that uses a document-oriented architecture.
- Cassandra: A distributed NoSQL database designed for high availability and scalability.
- Redis: An in-memory data store often used for caching, session storage, and real-time analytics.
- Neo4j: A graph database that specializes in storing and querying interconnected data.
Use in data science and data engineering
It’s worth considering the virtues of SQL as a language compared to other data engineering languages. Typically, its versatility is one of its core benefits, as a single language can be used to perform multiple tasks, including:
- Data Definition Language (DDL): SQL can be used as a definition language. When used in this way, queries perform certain operations, including creating, altering, or deleting tables and schemas.
- Querying and data manipulation language (DML): You can also use SQL as a data manipulation language. Unlike a DDL, a DML doesn’t create or delete the structures that define the data. Instead, it moves it around, modifying and manipulating it using commands. In some ways, DML remains one of the most popular use cases for queries today.
- Data analytics: SQL is used to perform complex calculations and statistical analysis.
- Data transformation: SQL can also be used to clean, format, and integrate data from various sources.
- Easy optimization: SQL is highly modular and customizable. This means that it can be optimized easily.
Why do data analysts use SQL?
SQL is also used by data analysts, the end users of the data ecosystem. Typically, data scientists use SQL to perform either ad hoc or recurring queries, joining datasets from multiple sources to perform analysis. At the end of this process, the analysis is fed into business intelligence tools and dashboards. Increasingly it is also used for AI models.
CRUD Operations
Many SQL queries take the form of CRUD (Create, Read, Update, Delete). SQL statements that use these operations form the backbone of database management, allowing you to create, retrieve, modify, and delete data.
ANSI SQL
Part of SQL’s power is its standardization. Although the language exists in different versions, known as SQL dialects, there remains a standardized version of the SQL language known as ANSI SQL, maintained by an independent standards agency to improve compatibility across different systems.
SQL vs Python
SQL is often compared to Python, the other dominant language of the data world. Although they can be used to perform similar tasks, they can also be complementary, with SQL controlling queries and Python being used more often in data pipelines.
One core difference is usability. As a user-friendly language, SQL appeals to a wide audience with a variety of technical abilities. On the other hand, Python is a general-purpose programming language that can be customized and adapted in many ways using User-Defined Functions (UDF).
Ultimately, SQL and Python are likely to remain the two pillars of data analytics. Importantly, they are also used in the construction of AI models and generative AI.
SQL and data federation
Today, big data poses many unique challenges for the scale and distribution of analytic insights. One of the large problems stems from the fact that data is frequently scattered across various sources. SQL can be used to access data from multiple sources using a technique called data federation.
Data federation takes data from different locations and makes it accessible through a single location or single point of access. Starburst, powered by Trino, uses SQL in this way. Using this approach, users can leverage SQL queries to access data spread across multiple systems. Because the data does not have to be moved to be accessed, data federation is a powerful alternative to data centralization. In fact, using federation, users are able to centralize as much or as little data as they like, opening up choices over their data architecture.
SQL and AI
Throughout its history, SQL has proven an adaptable tool. Today, this process continues with the adoption of AI and machine learning (ML) workflows. Additionally, SQL plays a critical role in the development of generative AI models. For these models to hold value, data must be fed to them in predictable ways. Importantly, both data analytics and AI modeling require robust data pipelines to power the models that train AI.
Once created, SQL can also be used as a method of querying AI models. In fact, Starburst currently employs both Text-to-SQL and SQL-to-Text features that make it easier than ever to use SQL and AI together.
SQL and Starburst
SQL and Starburst work together to provide seamless access to data across multiple platforms. Starburst is powered by Trino, a SQL query engine that takes the simplicity of SQL and applies it to all of the processes in the big data ecosystem. Using Starburst, you can use SQL to query, power analytics dashboards, perform ETL and data pipeline tasks, and feed your data into AI models. You can use it to access one data source or may incorporate other languages like Python for the best of both worlds.
This approach eliminates the costs and challenges of data migration while allowing for real-time analysis with SQL queries. As a result, businesses can access unified data and make faster, data-driven decisions without the need for constant updates or monitoring.
Interested in learning SQL? Check out this this video.