Last Updated: 2024-03-04
Google Sheets is a web-based spreadsheet application developed by Google as part of the Google Suite. Like other spreadsheet applications, it stores data in rows and columns.
Starburst Galaxy is able to connect to Google Sheets data using a Google Sheets catalog. Once connected, this data can be queried just like any other datasource.
In this tutorial, you will learn how to configure a Google Sheets catalog in Starburst Galaxy. To facilitate testing, we've provided sample data in a CSV file.
The first part of this tutorial will walk you through the process of uploading the CSV file to a new Google Sheet. You will then create a catalog in Starburst Galaxy and connect it to your Google Sheet. Finally, you will run a few queries against that catalog to learn more about the nuances of querying a spreadsheet.
Once you've completed this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
It's time to get started, and you're going to begin by getting familiar with the data. The sample data for this tutorial is stored as a CSV file in Google Drive.
You're going to open this CSV file using Google sheets.
You're going to begin by signing in to your Google account and accessing the sample CSV file using Google Drive.
A Google service account JSON key is required to configure secure access between Starburst Galaxy and your Google Sheet. As part of this access configuration, you need to share your Google Sheet with your Google service account email address. This step will show you how to find that email address.
Now that you have your service account email, it's time to share your new Google Sheet with your service account.
Now it's time to pivot over to Starburst Galaxy to begin the process of connecting your Google Sheets spreadsheet.
Starburst Galaxy separates users by role. Configuring a new catalog will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.
This is a quick step, but an important one.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Your current role is listed in the top right-hand corner of the screen.
Now it's time to create a new Google Sheets catalog in Starburst Galaxy. This catalog will allow Starburst Galaxy to connect to the data held in your Google Sheet.
The steps below will show you how to start the process of configuring a new catalog.
Starburst Galaxy allows the creation of catalogs for a number of different data sources. In this case, you are going to create a new catalog in the Google Sheets category.
The new catalog needs both a name and description. This ensures that you can find it later.
When you connect Starburst Galaxy to a new datasource, it is necessary to undergo an authentication process. This helps ensure that you are connecting the right data source and that you have the appropriate permissions.
Now it's time to provide connection details. In this case, you'll need the JSON key from your Google service account to proceed.
You're almost there! Time to test the connection and then complete the process of creating your new Google Sheets catalog.
Notice that the read-only option is already selected for you, and you are unable to change it. This is because Google Sheets is a read-only catalog.
The catalog is now created, but you are not able to use it until you add it to a cluster. Clusters provide the resources to run queries against catalogs.
You can add your catalog to an existing cluster or create a new cluster.
Now it's time to test out your new catalog by writing queries to access data from Google Sheets.
When you query a Google Sheet catalog in Starburst Galaxy, the basic format of the query looks like this:
SELECT *
FROM
TABLE (catalog_name.system.sheet (id => ‘googleSheetId'));
The two pieces of information you must supply are the catalog_name
and googleSheetId
. Let's begin by locating your googleSheetId
for the sample CSV file.
A Google Sheet ID is a unique string identifier associated with an individual Google Sheets file. Each Google Sheet has one unique ID, which can be used as a way of referencing that Google Sheet to query the data inside it.
The easiest way to find the Google Sheet ID is by inspecting the URL of the Google Sheet in question. The ID is a part of every Google Sheets URL. This string can then be copied directly from your browser.
You're all set to query your Google Sheets catalog. Let's begin with a simple SELECT *
to return all columns.
catalog_name
with your Google Sheets catalog name and googleSheetId
with the Google Sheet ID that you recorded in the last step.SELECT *
FROM
TABLE (catalog_name.system.sheet (id => ‘googleSheetId'));
So far, so good. But in production environments, you're more likely to query a range of cells, rather than the entire dataset.
To do this, you will need to add the tab name to your SQL statement.
Let's write a query that will return cells A1 through B4 from the Google Sheet.
catalog_name
with your Google Sheets catalog name.googleSheetId
with your Google Sheet ID.TabName
with the name of the tab you're querying in the Google Sheet. SELECT *
FROM
TABLE(catalog_name.system.sheet(
id => 'googleSheetId',
range => 'TabName!A1:B4'));
When specifying a range of cells in a query, it is important to be careful. The Google Sheets connector assumes that the first row in the range is the header row. This can sometimes have unintended consequences.
This is especially important when you are joining a Google Sheet to a table in another data source.
Let's take a look at an example for reference.
catalog_name
with your Google Sheets catalog name. googleSheetId
with your Google Sheet ID. TabName
with the name of the tab you're querying in the Google Sheet. SELECT *
FROM
TABLE(catalog_name.system.sheet(
id => 'googleSheetId',
range => 'TabName!A2:B5'));
One of the most powerful things about Starburst Galaxy is its ability to easily join data in Google Sheets with other data sources.
Let's check out this functionality by joining data from the customer
Google Sheet with data from the tpch
catalog built into Starburst Galaxy.
catalog_name
with your Google Sheets catalog name. googleSheetId
with your Google Sheet ID. SELECT t.name, t.phone, o.orderkey, o.orderstatus
FROM
TABLE(catalog_name.system.sheet(
id => 'googleSheetId')) t
JOIN tpch.tiny.orders o ON t.custkey = t.custkey;
Now it's time to run one more federated query, but this time you'll add some conditional logic to restrict the results to the top ten most expensive orders in descending order.
catalog_name
with your Google Sheets catalog name. googleSheetId
with your Google Sheet ID. SELECT t.name, t.phone, o.orderkey, o.orderstatus, o.totalprice
FROM
TABLE(catalog_name.system.sheet(
id => 'googleSheetId')) t
JOIN tpch.tiny.orders o ON t.custkey = t.custkey
WHERE o.orderstatus = 'O'
AND o.totalprice > 100000
ORDER BY o.totalprice DESC
LIMIT 10;
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You're all set! Now you can query the data in your Google Sheets spreadsheet.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!