At AWS re:Invent 2023, we announced several features that help simplify and accelerate development on the data lake. In this post, we will look at streaming ingest, automatic data classification, automatic data maintenance, secure data sharing, and natural language processing (NLP) in Starburst Galaxy.
The data lake analytics platform
As the amount of data processed for application analytics continues to grow, more and more developers are turning to the data lake as a scalable and cost-efficient solution. However, building, governing, maintaining, and scaling a data lake requires specialized expertise and supporting technologies to get it right. For instance, a single use case could require an ingestion process, object storage, a compute engine, a governance tool, and a data catalog. And most data lakes support more than one use case.
Piecing these different tools together creates a complex and brittle data lake architecture that is not feasible for many data teams.
Starburst Galaxy was built to address these challenges by providing an all-in-one, open data lake analytics platform, allowing you to remove the burden of learning, integrating, and maintaining separate systems, all while retaining ownership of your data.
Streaming ingest
Streaming ingest in Starburst Galaxy enables you to continuously ingest data from Kafka into your data lake in near real-time, ensuring your data is ready for analysis within minutes of initial collection. This is especially important for latency-sensitive use cases that require fast ingestion and processing to accelerate anomaly detection and decision making.
With streaming ingest in Starburst Galaxy, engineers no longer need to write expensive custom code and stitch together commercial and open-source tools to land streaming data in their lake. Streaming ingest also automatically transforms and partitions the data into Iceberg tables, enhancing query performance, enforcing flexible schema evolution, and ensuring transactional consistency.
ABAC with AI-powered data classification
Once new data lands in the lake, you need to be able to quickly identify, secure, and govern that data. With the newly GAed attribute based access controls in Gravity (Starburst Galaxy’s universal governance layer), you can easily govern your data down to the row and column level by using tags. However, tagging is oftentimes a tedious and manual process.
Automatic data classification in Galaxy aims to remove that burden by proactively suggesting relevant tags from 20+ classifications that administrators can choose to accept or deny. This automation is particularly useful for teams handling sensitive data like personally identifiable information (PII). Now, as soon as PII lands in the lake, Galaxy will be smart enough to identify and restrict access to that data.
Automatic data optimization
Modern table formats like Apache Iceberg have made the aspiration of data warehouse-like performance within a data lake an exciting reality. The Iceberg table format natively supports a series of maintenance operations for efficient storage and fast query performance.
The new data optimization features in Starburst Galaxy allow you to configure and execute these maintenance tasks either on demand or on a schedule, leveraging the Iceberg API under-the-hood.
Universal data sharing
Data teams more and more often are faced with the daunting task of integrating third-party data into their analytics. This process is incredibly complex and time consuming due to the technical limitations and security concerns that exist.
With Starburst Galaxy, you can easily package data sets into shareable data products to power end-user applications, regardless of source, format, or cloud provider. New functionality now allows users to securely share these high-quality data products with third-parties, such as partners, suppliers, or customers.
Self-service analytics powered by AI
Not only are data lakes notoriously hard to manage, but the majority of data teams are understaffed. New AI-powered experiences in Galaxy, like text-to-SQL processing, will enable data teams to offload basic exploratory analytics to business users, freeing up their time to build and scale data pipelines.
Why wait?
With a host of new features designed to make it easier to build, manage, and scale a data lake, Starburst Galaxy is the perfect choice for companies looking to power data-intensive applications at a fraction of the cost of a typical warehouse model.
Whether you’d like guidance on getting started with Starburst Galaxy, would like to join one of our private preview programs, or just want to get up and running fast, we have a path for you.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).