Last Updated: 2024-01-23
If you've always wanted to learn SQL, this is the tutorial for you! It provides simple, easy-to-follow examples to get you up and running with SQL in no time.
Often, a drawback to learning SQL is the difficulty in finding an environment in which to practice. There is no need to worry about that in this case. Your Starburst Galaxy account comes equipped with sample data, so you'll be ready to go in no time.
This tutorial uses Starburst Galaxy to teach you basic SQL techniques. It is designed both for beginners and those wanting to refresh their SQL knowledge. Special emphasis will be placed on understanding how SQL operates under real usage conditions.
Once you've completed this tutorial, you will be able to:
You need a Starburst Galaxy account to complete this tutorial. Please see Starburst Galaxy: Getting started for instructions on setting up a free account.
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
Structured Query Language (SQL) is a powerful and standardized language used in the management of relational databases, data warehouses, and data lakes. It serves as the foundational language for querying database management systems (DBMS) and is widely adopted across various platforms.
ANSI SQL is a standardized version of SQL that establishes the official features and syntax for SQL adopted by other dialects.
While ANSI SQL provides a common framework, it's important to note that different data systems may implement their own dialects, resulting in variations in syntax and functionality. Despite these differences, the fundamental principles of SQL remain consistent, enabling users to interact with and extract valuable insights from databases regardless of the specific dialect in use. Whether you're working with MySQL, PostgreSQL, or other relational database systems, understanding SQL empowers individuals to efficiently query, manipulate, and manage data.
Before you start using Starburst Galaxy to write SQL queries, it's essential to grasp three key terms: cluster, catalog, and schema.
A cluster represents a collective of computational resources collaborating to process queries efficiently. You'll use a cluster when you enact a query in Starburst Galaxy.
A catalog serves as a conduit to a data source. Catalogs come in many different types and configurations. For example, you might set up a cluster that accesses modern data lake storage technologies like AWS S3 or Azure ADLS, or more traditional data warehouses like Snowflake or Oracle. Catalogs unleash the power of Starburst Galaxy to connect to all of these data sources at once.
If you're interested in learning more about different catalog connections, we have a series of tutorials to walk you through the process step-by-step.
A schema delineates the organizational structure of data within a database. Schemas have a lot of use in Starburst Galaxy too and are one of the main building blocks of your data.
Individual tables are the smallest component of this data architecture. They are used to structure relational databases in a system of columns and rows. They are a traditional data structure and have been the bedrock of Relational Database Management Systems (RDBMS) for decades.
The relationship between these components is hierarchical. A cluster may encompass multiple catalogs, a catalog may encapsulate several schemas, and a schema may comprise multiple tables.
The image below depicts this relationship in more detail.
Before we get started, it's time to set up the environment needed to run this tutorial.
This will make sure that you have the correct account, as well as the cluster, catalog, and schema needed to proceed further.
The tutorial requires you to have a Starburst Galaxy account set up.
When you sign in, you should see the cluster called free-cluster
, sample
catalog inside it, and demo
schema inside that.
These are the cluster, catalog, and schema you will use for the SQL exercises in this tutorial.
When you write a query, Starburst Galaxy needs to know which datasets you're accessing, and which schemas and tables you need.
Usually this means using the fully-qualified table name, which is written in a specific format: catalog_name.schema_name.table_name
. This follows the same hierarchical naming convention that you saw before.
You can think of this as the full name of the table. But just like a full name for a person, the fully-qualified table name is long. To avoid having to write this out each time, you can easily set the cluster, catalog, and schema for your session ahead of time and operate using shorter queries.
free-cluster
.sample
.demo
. SQL has evolved from its origins in relational databases to become the preferred language for querying data across various sources. It is now employed in diverse environments, ranging from modern data lakes and data warehouses to traditional databases.
This section will show you how to use SQL to query a sample dataset, with a focus on data manipulation.
SQL exists for many purposes. Overall, you can think of SQL being used for two distinct purposes:
DDL operations create and define both structures and objects in a database and define the metadata within a DBMS. Some examples of DDL operations include CREATE TABLE, CREATE INDEX, ALTER TABLE
, and DROP VIEW
.
DML operations modify data within a database. These are the kinds of operations you will be using most in this tutorial. DML is used to manipulate data. Records , or rows, are created, modified, and accessed via DML. Some examples of DML operations are INSERT INTO table VALUES (‘val1', ‘val2', ‘val3'), SELECT...FROM...JOIN...WHERE
.
Let's look more closely at an example of a DML command. The SELECT
statement specifies the output of a SQL query. In its simplest form, it identifies both the table and columns to select data from. The data returned from the SELECT
statement is called the result set.
The general syntax for a basic SELECT
statement is:
SELECT column1, column2,...
FROM table_name;
In the Starburst Galaxy query editor, it's time to view the data available for analysis in the astronauts table. This is one of the tables included in the sample dataset.
To do this, type the following query into the query editor and click the Run (limit 1000) button.
SELECT * FROM astronauts;
Now, imagine that you wanted to read the same data, but only from specific columns. Selecting all of the data would be an inefficient way of doing that. Luckily, you can restrict the columns that are returned by your query.
To do this, you will need to specify the columns by name that you would like to include. This is called projection.
The example below shows projection in action. Notice that the column names are separated by commas:
SELECT
name,
nationality,
mission_title,
mission_number,
hours_mission
FROM
astronauts;
LIMIT
and AS
clausesYou can control the output of a query in a number of different ways. For instance, you can restrict the number of results returned using a LIMIT, or change the names of the results columns themselves to make them more meaningful or useful using the AS clause.
Selecting all results works well, but it returns a lot of records. Sometimes, this might present a problem. Luckily, the LIMIT
clause exists. It specifies the number of rows that you would like to return.
In the code below, this is set to 10.
You can also use the AS
keyword to create an alias for each column.
Sometimes the existing column names are very long or aren't meaningful enough. Using AS
is a simple way to set the column name in the results to something different.
This alias only exists for the duration of the query.
SELECT
name AS full_name,
nationality,
mission_title,
mission_number,
hours_mission AS mission_duration
FROM
astronauts
LIMIT 10;
WHERE
clauseImagine that you wanted to restrict the results of your query to ensure that only results that satisfy a given condition were returned.
This would be an example of conditionality, and would apply logical conditions to your search results. If a record fits those conditions, it will be returned; if it does not fit those conditions, it will not be returned.
To accomplish this, you should use the WHERE
clause. The syntax for the command is shown below:
SELECT
column1,
column2, ...
FROM
table_name
WHERE
condition;
WHERE
clauseIt's time to try this out. The code below uses the WHERE
clause to query the missions
table and only return missions that are classified as a success.
SELECT
company_name,
status_rocket,
cost,
status_mission
FROM
missions
WHERE
status_mission = 'Success';
WHERE
and NOT LIKE
Do you notice anything interesting? There are many astronauts from both the U.S. and U.S.S.R. You can use the WHERE
clause in conjunction with the NOT LIKE
operator to return only the astronauts from countries other than the U.S. or U.S.S.R.
To do this, you'll also need to include a wildcard, which substitutes characters in a string. Two of the most common wildcards used with the LIKE
(or NOT LIKE
) operator are the percent sign (%
) and underscore (_
).
%
represents zero, one, or multiple characters._
represents one single character. If you are trying to search for countries that are not the U.S. or U.S.S.R. which wildcard do you think is needed? Notice that both countries start with the letters "U.S.", but one has extra characters where the other does not. For this reason, the %
wildcard is suitable.
Run the following query to see all of this in action:
SELECT
name,
nationality,
mission_title,
mission_number,
hours_mission
FROM
astronauts
WHERE
nationality NOT LIKE 'U.S.%';
There are several operators that can be used in the WHERE
clause. Use this table as your guide.
Operator | Description |
| greater than |
| less than |
| equal |
| greater than or equal |
| less than or equal |
| not equal |
| specifies the set of possible values |
| specifies a range of possible values |
| specifies a pattern |
ORDER BY
clauseImagine that you wanted to control the order in which results were displayed. To do this, you can use the ORDER BY
clause to sort a result set by one or more output expressions. These results are returned in ascending order by default. To sort the records in descending order, use the DESC
keyword, as shown in the example below.
It's time to test out this command.
ORDER BY
to view the results of the previous query in descending order of nationality and name.SELECT
name,
nationality,
mission_title,
mission_number,
hours_mission
FROM
astronauts
WHERE
nationality NOT LIKE 'U.S.%'
ORDER BY
nationality,
name DESC;
Aggregate functions operate on a set of values to compute a single result.
Some examples are:
COUNT()
- Returns the number of rows that satisfy a specific criterionAVG()
- Returns the average value of a numeric columnSUM()
- Returns the total sum of a numeric columnMAX()
- Returns the largest value in a specified columnMIN()
- Returns the smallest value in a specified columnIn the following example, you'll use one query to find three different aggregate values: the number of trips, the longest mission in hours, and the smallest mission in hours from countries outside of the U.S. and U.S.S.R.
Run the query to see the results:
SELECT
COUNT() as trips,
MAX(hours_mission) as longest_mission,
MIN(hours_mission) as shortest_mission
FROM
astronauts
WHERE
nationality NOT LIKE 'U.S.%';
GROUP BY
commandThe queries you've been using so far are great if you want to evaluate all the data together.
Now imagine that you want to increase the granularity level and evaluate the aggregates specifically for each country.
You can do this using the GROUP BY
clause, which groups rows that have the same values into summary rows.
Try it out yourself using the code below.
SELECT
nationality,
COUNT() AS number_trips,
MAX(hours_mission) AS longest_time,
MIN(hours_mission) AS shortest_time
FROM
astronauts
WHERE
nationality NOT LIKE 'U.S.%'
GROUP BY
nationality;
ORDER BY
commandNext, you're going to change the query to restrict the order in which the results are returned.
Add an ORDER BY
clause and sort the results by most trips per country. In the case of a tie, sort additionally by longest mission.
The result should resemble the code below.
SELECT
nationality,
COUNT() AS number_trips,
MAX(hours_mission) AS longest_time,
MIN(hours_mission) AS shortest_time
FROM
astronauts
WHERE
nationality NOT LIKE 'U.S.%'
GROUP BY
nationality
ORDER BY
number_trips DESC,
longest_time DESC;
ORDER BY
commandTime to try something new. Revisit the previous missions query, and add an aggregate function to determine who spent the most money on successful missions.
Order this query by the total highest cost using the code below.
SELECT
company_name,
SUM(cost),
status_rocket
FROM
missions
WHERE
status_mission = 'Success'
GROUP BY
company_name,
status_rocket
ORDER BY
SUM(cost) DESC;
The following exercises will allow you to practice what you've just learned.
If you're stumped, the solutions are provided for you in the next section of this tutorial.
SELECT
?The following answers provide the solutions to the questions asked in the last section of this tutorial.
Use them to check your answers or unblock yourself if you're stuck.
The * returns all columns from a table. To improve query performance, It is better practice to identify which columns you would like, rather than using the *.
The answer is 70. To get this answer, run the following query:
SELECT
name,
year_of_birth
FROM
astronauts
WHERE
year_of_birth <= 1930
The answer is Georgi Beregovoi and John H. Glenn Jr. To get this answer, you can run the following query:
SELECT
name,
year_of_birth
FROM
astronauts
ORDER BY year_of_birth;
The answer to Question 4 should use code similar to the following.
SELECT
id,
detail,
cost
FROM
missions
ORDER BY
cost DESC,
id DESC
LIMIT
100;
SELECT
count()
FROM
astronauts
WHERE
sex = 'female';
You can use the SELECT DISTINCT
statement to get the information you need. The query should return 56 rows, indicating 56 distinct companies.
SELECT DISTINCT
company_name
FROM
missions;
One of the most powerful uses of SQL is joins. Data is typically stored in multiple locations, and whether that means multiple tables, different schemas, or different data sources, connecting that data will usually require a join.
For Starburst Galaxy, joins are especially important because they allow you to federate your query across data residing in multiple storage locations. For example, you can join data from a data warehouse with data from a data lake. Although this data is housed in entirely different systems, you can access it as if it resided in a single system.
Excited? Let's begin exploring SQL joins.
A SQL JOIN
clause is used to combine data from two or more tables based on a common field or column between them.
There are four basic types of joins that will be covered in this tutorial: INNER JOIN, LEFT JOIN, RIGHT JOIN
, and FULL JOIN
.
(INNER) JOIN
returns records that have matching values in both tables.LEFT (OUTER) JOIN
returns all records from the left table and the matched records from the right table, adding NULL
for missing matches on the right side.RIGHT (OUTER) JOIN
returns all records from the right table and the matched records from the left table, adding NULL
for missing matches on the left side.FULL (OUTER) JOIN
combines the results of LEFT JOIN
and RIGHT JOIN
. The results contain all rows from both tables, adding NULL
for missing matches on either side.Now it's time to push further in your SQL journey with Starburst Galaxy.
In this section, you will use a different catalog and schema that are more suited to illustrating joins.
The tpch
and tpcds
datasets are standard sample data used for benchmarking and testing. You can read more about the origin of these datasets and the specific data structures in the TPC documentation. Both datasets are included in Starburst Galaxy's standard sample dataset.
The tpch.sf1
schema contains the following tables: customer, lineitem, nation, orders, part, partsupp, region
, and supplier
.
tpch
and the schema to sf1
. INNER JOIN
An INNER JOIN
selects records that have matching values in two tables. This type of join can be depicted in the Venn diagram below. For those familiar with set theory, an INNER JOIN
is the intersection of the two tables.
The syntax for an INNER JOIN
is:
SELECT
column_name (s)
FROM
table1
INNER JOIN table2 ON table1.column_name = table2.column_name;
customer
and orders
tablesLet's take a look at this join in action using the customer
and orders
tables.
Suppose you wanted to join the customers and their orders from the two tables.
First, run the following query to take a look at the first five rows from each of these tables. Don't forget that the data is being generated dynamically, so some of the entries may look a bit strange.
SELECT *
FROM customer
LIMIT 5;
SELECT *
FROM orders
LIMIT 5;
customer
tableThe output you return should look similar to the following.
order
tableThe output you return should look similar to the following.
customer
and order
table using INNER JOIN
A quick glance at the two tables shows that custkey
is the common column across both tables, so that can be used to join them.
Run the following query to try it out:
SELECT
orders.orderkey,
customer.name
FROM
orders
INNER JOIN customer ON orders.custkey = customer.custkey;
LEFT (OUTER) JOIN
A LEFT (OUTER) JOIN
returns all records from the left table and the matched records from the right table. The left table is the first table that is listed in the query, while the right table is the second.
The syntax for a LEFT JOIN
is:
SELECT
column_name(s)
FROM
table1
LEFT JOIN table2 ON table1.column_name = table2.column_name;
The following Venn Diagram is a visual representation of a LEFT JOIN
:
This example will look very similar to the one for an INNER JOIN
. This time, the query will return all customers and any orders they might have, ordered by customer name. Because it is a LEFT JOIN
, all records from the left table will be returned, even if there are no matching records on the right table. Run the following query to see the results:
SELECT
customer.name,
orders.orderkey
FROM
customer
LEFT JOIN orders ON customer.custkey = orders.custkey
ORDER BY
customer.name;
RIGHT (OUTER) JOIN
A RIGHT (OUTER) JOIN
returns all records from the right table and the matched records from the left table.
The syntax for a RIGHT JOIN
is:
SELECT
column_name (s)
FROM
table1
RIGHT JOIN table2 ON table1.column_name = table2.column_name;
The following Venn Diagram is a visual representation of a RIGHT JOIN
:
This example will utilize another table from the tpch.sf1
schema - the nations
table.
Take a look at the first five rows of the table to get an idea of its composition:
nationkey | name | regionkey | comment |
0 | ALGERIA | 0 | haggle. carefully final deposits detect slyly agai |
1 | ARGENTINA | 1 | al foxes promise slyly according to the regular accounts. bold requests alon |
2 | BRAZIL | 1 | y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
3 | CANADA | 1 | eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold |
18 | CHINA | 2 | c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos |
RIGHT JOIN
on customer
and nations
tablesNotice that the nationkey
column is also found in the customer
table. Let's use that to join the customer
and nations
tables.
A RIGHT JOIN
will return all customers and their nation of origin, if that information is available.
Run the following query to experiment with a RIGHT JOIN
.
SELECT
nation.name,
customer.name
FROM
nation
RIGHT JOIN customer ON nation.nationkey = customer.nationkey
ORDER BY
customer.name;
A FULL OUTER JOIN
combines the results of a LEFT JOIN
and RIGHT JOIN
. In other words, it returns all records from both tables, regardless of whether there is a match.
The syntax for a full join is:
SELECT
column_name (s)
FROM
table1
FULL OUTER JOIN table2 ON table1.column_name = table2.column_name;
The following Venn Diagram is a visual representation of a FULL JOIN
:
FULL
(OUTER)
JOIN
on customer
and nations
tablesNow it's time to try out FULL
(OUTER) JOIN
on our dataset.
Run the following query to return the first twenty rows from the customer
and orders
tables, with or without matches.
SELECT
customer.name,
orders.orderkey
FROM
customer
FULL JOIN orders ON customer.custkey = orders.custkey
ORDER BY
customer.name
LIMIT 20;
The following exercises will allow you to practice what you've just learned.
If you're stumped, the solutions are provided for you in the next section of this tutorial.
customer
as table 1 and orders
as table 2?customer
and orders
tables. You ran it before during the tutorial. This time, you're going to look at it again to find out some new information. NULL
value in the orderkey
column in the results set?SELECT
customer.name,
orders.orderkey
FROM
customer
FULL JOIN orders ON customer.custkey = orders.custkey
ORDER BY
customer.name
LIMIT 20;
custkey
14.The following answers provide the solutions to the questions asked in the last section of this tutorial.
Use them to check your answers or unblock yourself if you're stuck.
The results of the join would be the same, regardless of join order.
Because this is a FULL JOIN
, all rows are returned, regardless of whether there are matches. The NULL
value means that for that particular customer, there are no orders.
Use an INNER JOIN
for this one, joining the part, partsupp,
and supplier
tables.
SELECT
supplier.name AS supplier,
part.name AS part,
part.retailprice AS price
FROM
part
INNER JOIN partsupp ON part.partkey = partsupp.partkey
INNER JOIN supplier ON partsupp.suppkey = supplier.suppkey
WHERE
part.retailprice < 920
This is a LEFT JOIN
, with orders as the left table.
SELECT
customer.name,
COUNT(orders.orderkey) AS order_count
FROM
orders
LEFT JOIN customer ON customer.custkey = orders.custkey
WHERE
customer.custkey = 14
GROUP BY
customer.name;
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You have learned how Starburst Galaxy uses SQL to enact powerful queries, including different types of joins. Although all of the exercises in this tutorial focused on joining data within a single database, Starburst also allows you to easily combine data from multiple data lakes, data warehouses, and databases, even those that are non-relational. Check out some of our other tutorials for more information.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!