A bird and a bunny walk into a bar…
The bird says, “I’m the Python dataframe library with tons of optionality”. The bunny says, “I’m the SQL engine with heaps of optionality”. They instantly became BFFs. The end.
OK.. that “joke” did stink, but using Ibis with Trino for the backend is the complete opposite. It’s optionality2 and that’s a pretty sweet thing!!
This post is a repeat of pystarburst (the dataframe api), but this time using Ibis. As before, this post is NOT attempting to teach you everything you need to know about the DataFrame API, but does provide some insight into the subject matter. Let’s get into it!
Setup your environment
Like before, I’m assuming you already have Python3 and pip installed (go here if you need some help). We can start by installing Ibis and dependencies for the Trino backend.
$ pip install 'ibis-framework[trino]'
I’m using these instructions to continue on with the setup activities and will be using my personal Starburst Galaxy tenant to test with. Unlike in the instructions, I’m just hard-coding my connection details in my boilerplate code (masked, of course).
import os
import ibis
ibis.options.interactive = True
user = "lXXXXXXXXm/aXXXXXXXXXn"
password = "<password>"
host = "lXXXXXXXXXo"
port = "443"
catalog = "tpch"
schema = "tiny"
con = ibis.trino.connect(
user=user, password=password, host=host, port=port,
database=catalog, schema=schema
)
tiny_region = con.table("region")
print(tiny_region[0:5])
Test the boilerplate code
I’m using the CLI, but you could easily run this in your favorite web-based notebook.
$ python3 ibis-main.py
Yep, we basically ran a simple SELECT statement and we can verify in Starburst Galaxy’s Query history screen that it executed.
Explore the API
There’s a lot of solid documentation on the Ibis site, but the Basic operations page serves as good of a starting point as any other to get your hands dirty writing some code.
If interested in comparing side-by-side the DataFrame API code from Ibis with that from PyStarburst, just pull up my pystarburst (the dataframe api) blog alongside this one since I’m porting that code to work with Ibis.
Select a full table
custDF = con.table("customer")
print(custDF.head(10))
Use projection
projectedDF = custDF.select("name", "acctbal", "nationkey")
print(projectedDF.head(10))
Filter the rows
filteredDF = projectedDF.filter(projectedDF["acctbal"] > 9900.0)
print(filteredDF.head(100))
Notice that even though 100 records were requested to be displayed, there are only 7 records that meet this criteria.
Select a second table
Let’s create a DataFrame from the nation
table to later join with customer
. In the example below, we are chaining methods together instead of assigning each output to a distinct variable as we have done up until now.
# Grab new table, drop 2 cols, and rename 2 others
nationDF = con.table("nation") \
.drop("regionkey", "comment") \
.rename(
dict(
nation_name="name",
n_nationkey="nationkey"
)
)
print(nationDF.head(10))
Join the tables
This is the EXACT same syntax used in PyStarburst (and yes, PySpark, too).
joinedDF = filteredDF.join(nationDF,
filteredDF.nationkey == nationDF.n_nationkey)
print(joinedDF.head(10))
As you can see, the join results have all of the columns from all of the DataFrames.
Project the joined result
Like before, we can just discard the unwanted columns.
projectedJoinDF = joinedDF.drop("nationkey", "n_nationkey")
print(projectedJoinDF.head(10))
Apply a sort
This approach looks a little bit different than PyStarburst’s, but it is easy enough to follow.
orderedDF = projectedJoinDF.order_by([ibis.desc("acctbal")])
print(orderedDF.head(10))
Put it all together
While the creation of multiple DataFrame objects was used above, in practice (as discussed when fetching the nation
table) most DataFrame API programmers chain many methods together to look at bit more like this.
nationDF = con.table("nation") \
.drop("regionkey", "comment") \
.rename(
dict(
nation_name="name",
n_nationkey="nationkey"
)
)
custDF = con.table("customer") \
.select("name", "acctbal", "nationkey") \
.filter(projectedDF["acctbal"] > 9900.0)
apiSQL = custDF.join(nationDF,
custDF.nationkey == nationDF.n_nationkey) \
.drop("nationkey", "n_nationkey") \
.order_by([ibis.desc("acctbal")])
print(apiSQL.head(10))
Comparing with PyStarburst
The results are the same as before just like in the original PyStarburst post’s code that we just ported to use Ibis. The APIs are different enough that you likely would want to pick one and stick with it instead of trying to use them both daily. For me personally, I’m currently in the PyStarburst camp.
Sure, I work at Starburst which has SOMETHING to do with that, but it is really because it lines up more closely to the PySpark implementation I have spend a number of years working with. If I was starting from scratch I wouldn’t have the prior experience to drive me to think this way and I would likely do more personal research and comparison.
It surely is NOT a dig against Ibis and its optionality of being able to run the same DataFrame program against multiple backend SQL engines. That is an INCREDIBLE foundational feature of this framework. Fortunately, Trino’s connector architecture and long list of integrations offers that same kind of optionality at the SQL engine layer.
Of course, if there is a backend in the list below that you need this flexibility for that is not a supported integration with Trino (quack-quack comes to mind, but isn’t the only one in that list not currently integrated with Trino), then you’ll have to take a hard look at Ibis.
For Trino (and Starburst) fans like myself, it is surely a win if you chose to go down the Ibis path as we do believe in this integration and want to only make it better over time.
As for performance & optimization… I was running all these simple examples on the TPCH connector (and the tiny
schema at that) which absolutely does not allow any inference to be made from the limited set of examples I ran for this blog post. One would assume that the CBO would ultimately decide to tackle the problem the same way regardless of which DataFrame API implementation was used.
In fact, we ended up getting a VERY SIMILAR query plan for PyStarburst and Ibis as expected. The DAG on the left is from PyStarburst and the one on the right from the Ibis invocation.
If those pictures look like hieroglyphics and all that CBO & DAG talk was mumbo-jumbo, and you want to learn more, check out these free training modules from Starburst Academy.
Back to the visualizations. Yes, the text is amazingly small (and fuzzy) and is almost completely unreadable, but I do see something on the left that didn’t happen on the right.
The PyStarburst execution ended up running ScanFilterProjects instead of just ScanFilters that Ibis produced. Again… do NOT read anything into this; especially with the data generator connector I was using. It just lets me know I need to do some more side-by-side research.
For that, pushing the TPCH generated data into a modern table format like Iceberg and using a bigger sized schema could then offer some real testing opportunities.
All in all, I’m happy I stayed up until 2:00 am on a work night doing some exploration and sharing my initial findings.
Originally published, here.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).