Data pipelines and data lakehouses

How Starburst Galaxy can be used to construct the land, structure and consume layers of a data lakehouse

July 20, 2023

Evan Smith
Technical Content Manager
Starburst Data

Evan Smith
Technical Content Manager
Starburst Data

More deployment options

Request Enterprise trial license key →

ETL is the engine behind the data pipeline process, moving data from a raw state to a consumable one. Let’s unpack how this typically operates in a modern data lake or data lakehouse. Later, we’ll take a tour to see how Starburst Galaxy fits in this picture and how it can be used to construct the Land, Structure, and Consume layers typical of a modern data lake.

Understanding how data lakes became data lakehouses

Data pipelines feed mo dern data lakes, also known as data lakehouses, in a particular way. To understand how these technologies work, it’s important to understand what makes these data lakes unique compared to other storage systems. Additionally, it’s important to understand how data lakehouse table formats like Apache Iceberg, Delta Lake, and Hudi play a key role in this ecosystem.

Traditional data lakes

Traditionally, d ata la kes operate using cloud object storage, allowing them to hold structured data, semi-structured data, and unstru ctured data together in a single repository. Importantly, most data lakes also use the Hive table format, an older technology still widely used in many data lakes today. Hive allows data lakes to be queried using SQL but does not allow for any of the advanced features of modern data lakes, also known as data lakehou ses.

Traditional data lakes can be constructed using either a three-part or two-part data pipeline architecture.

Two-part structure

Land Layer

First, data lands in its raw state.

Consume

Finally, data is transformed into a finished, consumable state.

Three-part structure

Land Layer

First, data lands in its raw state.

Structure Layer

Second, data is transformed into an intermediate state.

Consume Layer

Finally, data is transformed into a finished, consumable state in a third layer.

Open data lakehouse

Another important technology is the Open data lakehouses. This approach uses modern table formats, allowing for features not possible using Hive.

These include:

Iceberg
Delta Lake
Hudi

Modern data lakes are typically constructed using the three-part data pipeline architecture.

File formats

Both traditional and modern data lakes make use of a variety of file formats. These control how the data is stored in the lake.

Traditional data lake file formats include:

JSON
XML
CSV

Modern data lake file formats include:

Parquet
ORC

Land layer

Let’s unpack the concept of ETL pipelines and understand how they operate at the heart of data pipelines. Data begins in source systems, and you can think of these systems as the start of the data pipeline. These systems vary but generally conform to the following pattern.

Human-generated and machine-generated data

Source data originates from multiple sources, including:

Human-generated data —for example, website click data.
Machine-generated data—for example, sensor data or data from instrumentation.

Multiple data structures

Like all data lakes, data in the land layer arrives in multiple structures.

Structured data
Semi-structured data
Unstructured data

The raw storage principle

Data in the land layer is always stored in the same format in which it is ingested. This preserves a record of the data in its original form, before any transformations or cleansing takes place. If there is ever a need to return to this original state, this copy is a valuable resource.

You can think of this as similar to a photographer shooting photos in the RAW photo format. The RAW format includes all available information from the sensor, even information that may not actually be used. Nonetheless, the RAW format is a good choice for photography because post-production editing might make use of that information at a later date.

In a similar way, the land layer preserves the maximum amount of information, even if it is not needed. This ensures that all available data is retained for future use.

Batch ingestion vs. streaming ingestion

Data is typically copied to the land layer using one of two methods: batch ingestion or streaming ingestion.

Batch ingestion

Batch Ingestion is a classic approach to ingestion that involves copying a group of files at a regularly-scheduled interval or after a trigger condition has been reached.

For example, a batch-ingested workflow may add new data to the land layer every four hours, every hour, or every fifteen minutes.

Streaming ingestion

The alternative to batch ingestion is streaming ingestion. Streaming captures new data in real time. To do this, data is divided into tiny, micro-batches continuously processed as they arrive.

Streaming is typically handled by technologies that specialize in ingestion, including:

Kafka

A messaging function used as an intermediary agent between two systems, allowing the source system to transfer data to the Land layer reliably. Starburst Galaxy and Starburst Enterprise both integrate with Kafka.

Flink

Flink is a streaming platform designed to facilitate data ingestion. Unlike Kafka, it is not considered a messaging platform. It can also be used for batch processes, and is therefore a versatile tool. Starburst Galaxy and Starburst Enterprise both integrate with Flink.

Apache Spark

Spark is a general-purpose framework that allows for custom data ingestion tasks. Importantly, it is not a streaming or messaging platform. Instead, it is similar to a general scripting language. In that sense, it is both powerful and customizable, but also technical and complex. Many data pipeline ingestion workflows in the world run using Spark.

Advantages and disadvantages of batch ingestion and streaming ingestion

Advantages

Disadvantages

Batch ingestion

Easy to audit.
Particularly suitable for compliance, for example, GDPR.
Simple to troubleshoot and fix if something goes wrong.

There is a delay between the batch being ingested and its availability for consumption.
Takes time before data can be queried by consumers downstream.

Streaming ingestion

Data is available in near-real time.
Data can be queried very quickly.
Very useful for rapidly-updated information.

Only suitable for more specific use-cases, including:

- Click-stream data
- Data that has to be ingested from a data lake in:
  - JSON
  - AVRO
Situations that have specific requirements for data quality that involve streaming.

Structure layer

When data is transformed from the Land layer into a new location, that new location is called the Structure layer.

In a data lakehouse, the Structure layer is the source of truth for all operations. Importantly, most of the major transformations occur between the Land layer and Structure layer.

The exact nature of these transformations depends on the datasets and structures involved, but typically include the following steps.

Normalization

Data from multiple locations usually arrives in different structures. Normalization harmonizes these differences into a common structure. To do this, divergent fields must be mapped into others.

For example, the date format from one organization might conform to the mm-dd-yyyy structure, while the date format from another organization might conform to the dd-mm-yyyy structure. Or a Customer ID from one system might include a different number of digits than the Customer ID from another.

If you want to be able to query data from both datasets, you are going to have to normalize the data and transform the data into a single structure with defined rules.

In the modern data lake, this typically happens after the data is queried using schema-on-read. Starburst Galaxy and Starburst Enterprise both allow you to use SQL to enact these transformations, including use of INSERT INTO commands that achieve normalization. The exact degree of normalization depends on the difference in structure between the datasets involved.

Data validation

Data validation is also used as a tool in the Structure layer. This step involves checking that all data conforms to the agreed normalization transformations and sets down rules for cleansing any data that does not conform.

Data validation also uses SQL and both Starburst Galaxy and Starburst Enterprise make it easy to perform data validation in a data pipeline. To do this, IF statements are written to check for certain conditions using boolean logic. If the conditions are true, the data is valid; if the conditions are false, the data is not valid.

Data enrichment

The Structure layer also involves data enrichment. This occurs when the number of columns in a dataset’s schema needs to be widened to accommodate additional data. This might occur if a schema evolves beyond its initial architecture or if a new column is being added for other reasons.

For example, a customer field might include an additional column for credit scores only when such a check is necessary. This addition would involve enrichment. Notably, enriched data is often taken from outside systems beyond the Land layer.

Like the other transformations in the Structure layer, enrichment can be achieved using SQL in Starburst Galaxy or Starburst Enterprise. To do this, the business logic controlling the enrichment must be translated into SQL code, often involving conditional logic and IF statements.

Sometimes this involves the use of specialized functions known as user defined functions. Functions of this type typically use the ADD function to expose the location of the enriched data.

Technical transformation

Finally, technical transformation harmonizes data from different schemas and structures into one data type. When complete, the new dataset is saved into a new format. This is the moment when the Structure layer is created.

Consume layer

First, the Consume layer contains data that has finished its journey through the data pipeline. At this stage, it is ready to be used by data consumers.

How is data consumed?

Importantly, data is consumed in a number of different ways. These include:

Queries

Traditionally, the most basic way in which data is consumed is through ad hoc queries. These might be:

Views, which return data matching the query.
Materialized views, which cache that data for use at a later date.

Business Intelligence (BI) tools

Additionally, data from the consume layer is also used to feed BI tools, including:

Tableau
PowerBI

Data products

Beyond this, data products provide another pathway to consume data. These curated datasets allow you to tag, save, and share data across teams. This exciting development empowers data consumers to take charge of their own datasets in new ways. Starburst Galaxy and Starburst Enterprise both use data products to enhance data visibility, data discovery, data sharing, and data governance.

Typical data consumption activities

The first thing to understand is that users consume data in a number of different ways. Each of these approaches is important, and the data lakehouse addresses each of them differently. Let’s look through the different ways that data lakehouses use data to see how data is consumed in each case.

Data aggregation

First, you have data aggregation. Aggregations involve the grouping of data according to conditional logic. Typically, these analytical queries use the SQL GROUP BY command to return values that match the conditions set out.

Rollups and CUBEs

Importantly, sometimes multiple aggregations are necessary. In these instances, a Rollup can be used to execute multiple GROUP BY commands together.

A special type of rollup known as a CUBE combines multiple Rollups, generating a multi-dimensional model.

Windowing

Another approach involves the use of Windowing functions. These can be used to create rolling averages that compare one column to another.

For example, windowing might be used to return stock prices or average sales in real time. This is typically used by data consumers to create reports, find trends, and build actionable business intelligence.

Tutorial 1: Lakehouse data lakehouse reference architecture

Setting the stage

First, let’s unpack the data lakehouse by exploring a reference architecture in more detail. The video below introduces the topic and outlines the key role played by each component in the data pipeline process.

Tutorial 2: Common vocabulary surrounding the data lakehouse

Understanding the terminology

Next, it’s important to unpack the complex vocabulary surrounding data lakehouses. The video below unpacks this terminology in more detail.

Tutorial 3: Technologies and tools for the data lakehouse

Knowing your toolset

Finally, it’s important to understand that data pipelines can be constructed on data lakehouses using a number of different technologies and tools. The video below outlines which tools you are likely to encounter in this space and where Starburst Galaxy fits.

Try Starburst Galaxy today

The analytics platform for your data lake.

Start free

Starburst’s mission is to free our customers to see the invisible and achieve the impossible