Azure data lake with Starburst Galaxy for better analytics
Tom Nats
Director of Customer Solutions
Starburst
Tom Nats
Director of Customer Solutions
Starburst
Share
More deployment options
Microsoft Azure Data Lake is a cloud-based platform for managing enterprise data analytics at petabyte scales. Microsoft’s global cloud infrastructure lets companies build their data lake architectures on a platform with high availability, low latency, and flexible pricing.
Below, we will introduce Azure Data Lake and its benefits before explaining how you can combine it with Starburst’s modern data lake analytics platform (along with a demo) to give your analytics users a self-service query platform for simple ad hoc research and complex machine learning projects.
The three components of Azure Data Lake
A data lake is a centralized repository for an enterprise’s structured and unstructured data, giving engineers, data scientists, and business users a single source for data generated across the company. A modern data lake, or a data lakehouse, combines the data lake’s efficient storage and scalability with the data warehouse’s accessibility and performance. This architecture makes enterprise data more accessible for business users, analysts, and data scientists to speed time to insight, foster innovation, and drive data-driven decision-making.
Azure Data Lake lets companies design the optimal data lake architecture by combining several Microsoft storage and analytics services, all running on Microsoft’s cloud computing platform.
The three core components of an Azure Data Lake include:
- Azure Data Lake Storage (ADLS),
- Azure Synapse Analytics, and
- Azure HDInsight.
In addition, companies can integrate other Azure services, Microsoft enterprise technologies, open-source frameworks, and third-party applications to create a complete data lake technology stack.
What is Azure Data Lake Storage (ADLS)?
Azure’s ADLS (Azure Data Lake Storage) is a cloud-based, low-cost enterprise data lake storage solution. ADLS provides a high-performance along with a high level of redundancy making it the perfect solution for a company’s default storage location.
As companies migrate off Hadoop and legacy based data warehouses, they are choosing ADLS as the new home for this data which allows different “process engines” to work on this data to perform many different business functions such as analytics.
Originally, ADLS was a separate cloud service within the Azure portfolio that let companies store structured and unstructured data. The next-generation Azure Data Lake Storage Gen2 merges the data lake platform with Microsoft’s binary large objects (blobs) storage service, Azure Blob Storage. Any business With an Azure Storage account can activate these capabilities by enabling hierarchical namespaces.
Azure Data Lake Storage Gen2 and Azure Blob Storage
Since ADLS Gen2 is now a capability of Blob Storage, companies get instant access to most of the features they may already use within their Azure-based cloud infrastructure. For example, pay-as-you-go pricing, object-level tiering, and automated lifecycle management let companies efficiently manage their data storage costs.
Microsoft designed Azure Data Lake Storage for the realities of enterprise data. Where the capacity constraints of traditional data warehouses struggle with growing data volume and velocity, a data lake built on top of Blob Storage ingests modern data flows effortlessly. You can store trillions of files — any of which can be more than a petabyte in size.
Elements of ADLS Gen2
An Azure Data Lake uses the Apache Hadoop Distributed File System (HDFS), the open-source big data management framework. Since it’s built to an open standard, an Azure Data Lake Storage solution can tap into the deep ecosystem of open source and third-party applications. For example, big data analytics query engines like Apache Spark or Trino (which lies at the core of Starburst) can directly access ADLS Gen2 data.
Azure Data Lake Storage replaces Blob Storage’s default flat structure with a hierarchical namespace for performant access to enterprise-scale object storage. Like the file system on Windows, ADLS Gen2 organizes objects and files in a directory and subfolder hierarchy rather than the practice of embedding paths into the file names. Hierarchical namespaces let you change directories atomically without rewriting millions of objects, significantly reducing big data analytics workloads.
Migrating to Azure Data Lake Storage Gen2
Many Starburst customers have used our analytics platform to facilitate their migration from an on-premises data lake solution to the Azure Data Lake Store. Starburst creates a virtualized access layer that federates disparate enterprise data sources into a single point of access. This level of abstraction relieves data consumers from knowing the intricate details of each source, like its structure, data formats, or SQL variants.
Rather than a massive — and potentially disruptive — lift-and-shift initiative, a more considered approach would simplify your ADLS migration without impacting your users. For instance, some data in your on-premises solution may not need to move. Also, you may want to include data from recently-added sources or previously inaccessible data silos.
Starburst’s connectors integrate disparate enterprise data sources, abstract their details, and make them accessible within a single interface. When data migrates from one source to another, your Starburst admins will change the connector. From the user’s perspective, nothing has changed. Their queries still work. Data products are unaffected. With Starburst, the migration to Azure Data Lake Storage becomes transparent.
What is Azure Synapse Analytics?
With the transition to ADLS Gen2, Microsoft replaced the original Azure Data Lake Analytics service with Azure Synapse Analytics, combining multiple enterprise analytics services to streamline data management and accelerate time to insight.
Azure Synapse Analytics provides several ways to analyze data at scale. The service is an engine T-SQL, Microsoft’s proprietary version of SQL, and lets Synapse access data warehouses, Microsoft SQL Server relational databases, and other data stores. Data teams can use Apache Spark to conduct big data analytics, while Azure Data Explorer lets engineers analyze semi-structured logs and telemetry data.
Synapse Studio ties everything together through its visual, code-free development environment. Data users can take advantage of Synapse Studio’s simple interface to bring data into business intelligence software, explore data sets for big data analytics projects, or manage ETL pipelines.
What is Azure HDInsight?
Azure HDInsight is a managed cluster platform that lets businesses run open-source data frameworks on Azure’s efficient, scalable cloud computing platform. HDInsight supports Spark, Hive, Kafka, and more to process large data volumes quickly without driving up compute costs.
HDInsight does not lock Azure Data Lake customers into a particular environment or language. In addition to Microsoft’s Visual Studio, data users can build their systems in Eclipse, IntelliJ, Jupyter, or Zeppelin. Naturally, they can code with .NET, but Azure Data Lake users can program in Java, Python, R, or Scala if those languages are more appropriate to the task.
What are the benefits of Azure Data Lake?
Data lake solutions solve many challenges imposed by a data warehouse’s limitations and proprietary nature. An Azure Data Lake delivers on the architecture’s promises.
Offering more flexible data storage: Lakes can ingest structured and semi-structured data as well as unstructured data, like the real-time streaming data generated by Internet of Things (IoT) systems.
Optimizing performance and cost: By decoupling compute from storage, data lakes allow companies to optimize their investments for each without imposing tradeoffs. Commodity cloud computing platforms let storage and compute scale with their respective demand. Analytics get the performance they need, while storage budgets become more predictable.
Making data more accessible: By providing a central repository for all enterprise data regardless of its structure, data lakes help democratize data access. The right analytics layer lets engineers, scientists, analysts, and business users access the data they need with the tools they already know how to use.
In addition to the core strengths of the architecture, Azure Data Lake offers additional benefits — especially for enterprises already committed to an Azure-based cloud infrastructure.
Leverage Microsoft’s global cloud footprint
Since Azure Data Lake Storage Gen2 builds upon the core capabilities of Azure Blob Storage, these data lakes automatically gain the reliability and scalability of an established cloud storage platform. With over three hundred data centers in over sixty regions, Azure combines localized performance and compliance with a performant, low-latency global footprint.
Accelerated time to insight
Azure Synapse provides a unified interface for machine learning development and business analysis. Data scientists can use the programming language of their choice, including Python, R, and .NET, to develop advanced analytics projects. Integrations with analytics tools like Microsoft’s Power BI visualization software democratize data access. More insightful analysis of business questions makes the organization more agile and decision-making more effective.
Streamlined data lake management
Companies can minimize storage costs without sacrificing performance since Azure Data Lake Storage Gen2 uses the same hot, cool, cold, and archived tiers as Blob Storage. The cool and cold tiers provide a low-cost option for less frequently accessed data but with the same access speed as Blob Storage’s hot tier.
Improved big data security and compliance
Azure customers benefit from Microsoft’s global investment in cybersecurity infrastructure, including its staff of over 3,500 security experts. Azure Data Lake further enhances security by letting customers build policies based on role-based access control (RBAC) and Portable Operating System Interface (POSIX) compliant access control lists (ACLs). Integrations with other Microsoft security services include Entra ID (formerly Azure Active Directory).
Sign-up today to receive up to $500 in free credits.
Starburst Galaxy enables querying Azure data lakes with just a few clicks. This provides self-service analytics by providing the industry leading SQL based query engine in a fully managed service.
How to create a cost-effective Azure lakehouse data strategy
Azure Data Lake Storage (ADLS) is a great technology that provides low-cost, redundant storage.
Why use Starburst Galaxy with your Azure data lake?
We just announced the launch of Starburst Galaxy on Microsoft’s Azure cloud service. Starburst Galaxy is the new fully-managed SaaS service from Starburst that provides enterprise-ready Starburst clusters which can be deployed within minutes.
With one of the lowest TTQ (Time to Query) engines on the market, means you can provide your analytical users with a self-service query platform against Azure Data Lake Storage (ADLS) and other Azure data sources without dedicating valuable resources to deploy and manage these resources.
Azure data lake with Starburst Galaxy for better analytics
The Starburst engine powers Starburst Galaxy. It is the best-in-class MPP SQL engine originally based on open source Trino with enterprise enhancements made by Starburst. Developed out of Facebook as a replacement for Hive to query their massive data lake, Trino has quickly become the de facto data lake and lakehouse query engine over numerous different storage systems such as ADLS.
0 to Querying ADLS in less than 5 mins
Starburst Galaxy can provision new clusters within minutes into any Azure region that houses your data.
In the diagram below, Starburst SQL query clusters are deployed in the same region as the data eliminating egress costs and ensuring maximum performance.
Creating new clusters and adding data sources takes just a few clicks and from there, querying data can be performed using the query editor provided within Galaxy or any BI or SQL tool such as Power BI, DBeaver and data science notebooks such as Jupyter.
Starburst Galaxy makes it very easy to query your Azure ADLS data lake with 4 easy steps:
If you have data in MySQL or PostgreSQL, you can query those data sources and even federate data between them if needed. For more information about these features, please see our previous blog post titled: Azure Data Mesh: Powered by Starburst.
“We’re excited to partner with Starburst to provide our customers with the power of Microsoft’s Azure cloud platform alongside Starburst Galaxy.
The combination of our two products will help our customers add tremendous value to their data ecosystem.” said Ken Miller, General Manager, West Region Cloud, Microsoft.
Azure ADLS and Starburst Galaxy Demo
In the demo below, we show you how easy it is to create a Galaxy account and start querying your Azure ADLS data within a few minutes:
As you can see from the demo, Galaxy provides one of the fastest TTQ (time to query) SQL query engines for Azure. Some other features provided by Galaxy are:
- Suspend mode – Clusters can be set to go into “suspend” mode while not being used. When a cluster is in suspended mode, there are no charges being incurred. When a new query is executed against the suspended cluster, it will start up and complete the query.
- Blue/Green deployments – Galaxy deploys new clusters behind the scenes for configuration changes and upgrades ensuring there is a seamless, uninterrupted experience for end users.
- Query Editor – built-in, easy to use web-based query editor. No need to install extra software just to query your ADLS and other data sources