I am trying to understand support for Python in Trino/Starburst. Based on documentation on Python Clients on Starburst doc it seems like Python support is based on the implementation of Python Database API.
Questions -
- Does this implementation allow the use of Python Dataframes and packages like NumPy, Scikit-learn in the client app?
- How different is therecently launchedsupport for PySpark from what used to exist earlier for PySpark support? Does that mean that PySpark apps can run against Starburst without the need for Spark Jars and JDBC drivers on the client side? This page still mentions the following -
PySpark requires Spark JARs as well as a JDBC driver. This leaves your SQL query two layers removed from a direct DBAPI implementation.
The Python library being referred to create DataFrame APIs that do lazy execution on the cluster. It is a goal to have it interoperate with Python libraries like you’re discussing.
The current DB API based python client libraries are purely for SQL access AND it is not recommended that PySpark be used with it because of perf impact (JDBC + Spark dependencies)
This is true - but without Spark, the libraries don’t have performance impacts - you’re just writing SQL for accessing Trino.
The new Python library, announced in Feb and in private preview right now, will interop with the standard Python packages like NumPy, etc. AND will allow native execution of PySpak.
The new library doesn’t leverage Spark at all. It implements DataFrame APIs that “compile” to SQL to run again Trino. Some of the APIs may looks similar to PySpark’s DF APIs but Spark itself is not involve here.