Starburst Lakehouse: Data Warehouse Functionality, Without The Cost
Dan Brault
Director, Product Marketing
Starburst
Dan Brault
Director, Product Marketing
Starburst
Share
More deployment options
Next-Gen data management and analytics strategies
We’ve all lived it. Heard it. Adapted to it. The next analytics strategy with numerous ‘modern’ technologies to solve the endless pain of data management and analytics.
Many data and analytics professionals have long recognized the innovation of data management and analytics technologies has outpaced our ability to uniformly adopt them. We aren’t all “_______” users, although it may feel like that at times. This is a good thing!
Data and analytics needs vary from enterprise to start-up
Every company is different and there aren’t one-size-fits-all solutions to the uniqueness of every data decision and complexity of our business problems. The analytic needs, and technologies to best solve them may look very different for a large global enterprise on a 5 year digital transformation journey, compared to a cloud native start-up.
Generally, many organizations aren’t ‘in a camp’. They don’t have one stack. They aren’t modern. They aren’t a perfect ‘mesh’ (yet). And they certainly don’t have a single source of truth.
Lakehouse leverages the benefits of a data warehouse and a data lake
To meet customers where they are today, and help them move forward tomorrow with the analytics strategy that best serves them. Where is one intersection? The Starburst lakehouse. Bringing the functionality of the warehouse to the end users, with the scale and lower cost of a data lake.
Here, the business logic is being applied directly to the data in the lake, rather than after it’s moved to a staging area for a cloud data warehouse. Data ‘lands’, is ‘structured’, and ‘consumed’ all in one location – this is a transition from ETL to ELT.
By combining the benefits of a data warehouse with the benefits of a data lake, organizations can reduce costs in addition to achieving a higher level of efficiency and performance.
Capabilities and benefits of a data lakehouse
While admittedly, a data lakehouse may not be the best analytics strategy for ALL companies, it has far reaching benefits for many. Particularly, it is useful to organizations with a more robust data science team, data native companies, and those who use data as a differentiator to create a better experience for their customers. The efficiency of analyzing data at the source, or as close to the source as you can get, has a profound impact on time-to-insight, meeting the demands of today’s competitive market.
Optionality to transform any data lake into actionable insights
In most situations, having options is a great thing. It gives you leverage to move in different directions given future information to achieve the best possible outcome. Your data, and the technology investments to turn that data into actionable insights, is no different.
The first critical aspect to achieve freedom of choice and optionality of a lakehouse, is embracing the separation of storage and compute. This brings a new level of cost and performance efficiency, where data can be stored at the source such as object storage (Amazon S3, ALDS, GCS, Cloudera, IBM COS, MinIO, and more), and simply scale up and down your compute as needed.
Starburst’s connectivity to all major data lakes allows our customers to only pay for what they use, minimize data duplication, control the dial on cost/performance, and use multiple platforms to access the same data.
Open table formats allow organizations to perform data warehouse-like queries in a data lake. Columnar and read-optimized by design, these file formats (ORCFile, Parquet, Avro, SequenceFile, CSV, etc.) allow you to store data in a file system and still benefit from incredibly fast performance when performing analytics via SQL.
Open table formats like Apache Iceberg and Delta Lake allow users to interact with the data lake as easily as you would a database using SQL. Coupled with Hive, these open table formats try to solve problems typically associated with data lakes. Apache Iceberg and Delta Lake allow for more analytics to be served out of the lake and reduce the need for data movement/migration which provides substantial cost savings.
Additionally, these open table formats provide increased performance benefits over traditional formats such as ORCFile and Parquet. Data skipping and improved partition handling are just some of the many benefits offered by Iceberg and Delta Lake. Combined with Starburst’s best-in-class query engine, these open table formats can unlock the data lakehouse for organizations.
Starburst Galaxy’s Great Lakes connectivity makes working with different file and table formats seamless by providing unified connectivity which handles all of these formats (and whatever comes in the future…ahem… Hudi) but allows querying from them using regular SQL. In Starburst Galaxy, you simply choose which object store to connect with and we handle it all behind the scenes for you.
Maybe you’re pretty committed to Delta and S3. That’s great. But will you always be? What if there’s a better storage source and file format in the future? What if you acquire a company on a different cloud, region, and heaven forbid, on-premises? Decoupling users from the infrastructure prepares for whatever the future brings. Not only does Starburst currently support access to your sources and formats today, but you can be sure we’ll support future innovations here as well.
Furthermore, Starburst Stargate, provides a gateway for unlocking data access across geographies while ensuring access controls, and data residency requirements are honored. This means we can link catalogs and data sources supported by one cluster (i.e. AWS West) to catalogs and data sources in remote clusters (i.e. GCP East). No more cloud data lock-in and customers take back control of high egress fees.
Future-proof architecture. That’s real optionality.
Batch processing saves valuable time
We recently announced the public preview for query fault tolerance execution in both Starburst Galaxy and Starburst Enterprise. These capabilities enable use cases on the lakehouse that include, building large rollup tables, preparing datasets for machine learning models, and wrangling data that feed into data applications.
Trino is able to achieve incredibly fast speeds by prioritizing in-memory execution, along with lots of other performance enhancements like highly optimized ORC/Parquet readers, columnar reads, predicate pushdown, and lazy reads to name a few.
Data engineers’ time is valuable, so we focus on letting data engineers write business logic at the speed of thought. By taking advantage of standard SQL dialect between interactive and ETL analytics, engineers no longer need to learn different SQL dialects depending on the size of the job.
Data engineers can iteratively test SQL queries as they develop complex data pipelines because the coordinator is always up and waiting for a query as opposed to the lazy coordinator that takes longer to boot up than many other engines use.
Starburst Enterprise also provides query plan analysis so that data engineers can quickly debug issues, and access controls, audit logs, and query history to ensure regulatory compliance. We offer integrations with popular ELT tools like Airflow and DBT, so that you can use the tools you love, with an engine for interactive and long-running batch workloads.
Security and data access ensure data is accessible, available, discoverable, and secure
Of course, data access and security go hand-in-hand. We’ve recently received the ISO27001 certification which is a globally recognized security standard with a heavy focus on risk management based on controls, tracking, and addressing concerns. This is applied to the entire Starburst business versus a product which the SOC 2 certification looks at.
While Starburst already had obtained SOC 2 Type 1 certification, we decided to move forward with a Type 2 because the Type 1 covered a point in time while a Type 2 is more vigorous in showing we have used these controls over a span of time.
This certification also means that we have passed the auditors report on controls that we put in place to show how well we safeguard our customer data and how well those controls work.
Starburst Enterprise offers fine-grained access control that meets the most strict enterprise standards. This includes various cluster security options, and numerous authentication methods to meet our customers where they are with security as well.
Our customers are not forced to adopt our security protocols, with integrations with Apache Ranger, Sentry, and robust security and governance platforms like Immuta and Privacera. Our native security meets the needs of most organizations, as built-in role-based access control system enables RBAC with table, column and row level access control. It’s easy to configure, and apply, correct access rights to catalogs to individual schemas, tables, and views, ensuring the right people have access to the right data.
Starburst Galaxy is also designed with security in mind, supporting many authentication types for both data sources (AWS IAM, Azure service, principal, Okta, etc.) and client integrations (OAuth, JMT, etc.). It includes a powerful role-based access control system from the cluster, to tables and schemas. With end-to-end encryption, and more security enhancements in development, Starburst Galaxy provides the foundation for a secure and flexible lakehouse.
Performance and SQL data transformations
Starburst, and Trino, are ANSI-SQL compliant that allows users to use stand SQL, and their favorite data tools, including BI and ETL tools, with any underlying data source. A complete SQL statement and syntax reference can be found in the documentation on the numerous SQL functions and operators available. In addition to globally available statements and read operations, many write operations, including data management, materialized view management, and schema and table management, and view management also support more typical ‘data warehousing’ analytics on the data lake.
Support for these features, such as INSERT, DELETE, OPTIMIZE, and VACUUM (among others) extends to all file formats and sources. These help more advanced, but required capabilities for lakehouses, to support streaming workflows, or make large updates to a table while keeping it consistent and available for ETL workflows, or even to query a table as though it were some arbitrary time in the past (time travel).
All these transformations shouldn’t slow you down. Starburst also provides the performance and speed you demand. Numerous performance optimizations, an accelerated Parquet reader, advanced joins and partitions, pushdown, and caching all contributed to performance gains to quickly discover and improve time to insight.
For the lakehouse, our recent acquisition of Varada, combines Starburst best-in-class query engine with patented indexing technology that sets a new benchmark in data lake analytics, empowering organizations to more quickly and efficiently derive greater insights from their data. Smart indexing not only speeds queries, it also dynamically and autonomously updates cache based on analytical workload patterns. This provides customers with the best performance for the most important data and not overpaying for performance on data that’s infrequently accessed. No more unnecessary cloud compute costs.
Data products
Modern technologies such as Starburst are attractive since they obviate the need to move data around nearly as much. By leaving data at its source, or closer to the source, analytics can get closer to real-time, and with less infrastructure to worry about. As a component of the ‘consume’ portion of the lakehouse, Starburst Enterprise includes a module to create, discover, publish, manage, and share data products based on multiple datasets.
Curated, high-quality datasets shortens the path from raw data to trusted insights. Data engineers are empowered to easily create and share data products, and the consumers trust, and can easily leverage those data products with the tools they already use. Although data products can create the repeatability and efficiency for a successful lakehouse, they can also benefit any analytics journey, large or small, on-premises or multiple clouds, in a warehouse or a lake. And federates across all locations.
Partner ecosystem
Starburst also enables easy discovery and consumption of high-quality data through the creation and management of data products, data catalogs, and materialized views. Queried results can be consumed with client connections and integrations with your favorite BI platforms (Tableau, PowerBI, ThoughtSpot, Looker, Metabase, Superset, and more).
We dramatically improve the speed and concurrency for ad-hoc, batch, and reporting queries live against your datahouse, there’s less dependence on IT or data engineering. Starburst enables you to do all your BI tasks at immense scale, without sacrificing performance, directly against cloud data lake storage. Additionally, with numerous ML/AI/workflow management integrations (dbt Labs, Airflow, Python, R, DBeaver, and more), data teams can continue to use tools you know and love while getting more value out of your data.
Management and user experience
Every organization has unique needs and infrastructure complexities. Starburst Enterprise’s open architecture works with a robust ecosystem of data sources, clients, and security and authorization tools. Customers can tailor and enhance their deployment with custom integrations and extensions.
Starburst’s clean, intuitive, interface includes a Query editor with grammar and function autofill, and monitoring dashboards for query and cluster history, usage metrics, and more. Starburst also scales elastically, reducing the need for a dedicated operational team.
Starburst Galaxy is cloud native, frictionless, and fully managed, and is the fastest path from big data to better decisions. Kickstart your data lakehouse strategy in minutes. Leverage the agility to run interactive and ELT workloads in one query engine, with Great Lakes connectivity unifying table formats in all the most popular data lakes.
Test drive your Starburst Lakehouse today
Sure, modern data technologies innovations are ‘better’, but is the modern data stack really that modern? Despite these new tools and infrastructure options, we’re left with the same centralization paradigm. It’s a legacy paradigm with modern tools.
Instead, move your analytics closer to where the data is stored. Give yourself the optionality of the data lake, location, and file format that works for you. You can reduce, or remove, the need for a centralized warehouse, by transforming and curating the data for consumption all within the Starburst Lakehouse.
Try Starburst Galaxy today to kickstart your lakehouse in minutes!