Data pipelines and data lakes

  • Evan Smith

    Evan Smith

    Technical Content Manager

    Starburst Data

Share

ETL operates as the engine behind the data pipeline process, moving data from a raw state to a consumable one. Let’s unpack the way in which this typically operates in a modern data lake or data lakehouse. Later, we’ll take a tour to see how Starburst Galaxy fits in this picture and how it can be used to construct the Land, Structure and Consume layers typical of a modern data lake.

Understanding the modern data lake

Data pipelines operate in modern data lakes in a particular way. Understanding this begins by understanding what makes these data lakes unique compared to other storage systems. 

Traditional data lakes

Traditional data lakes operate using cloud object storage and hold structured, semi-structured, and unstructured data. Importantly, they also make use of the Hive table format, an older technology still widely used in many data lakes today. Hive allows data lakes to be queried using SQL, but does not allow for any of the advanced features of modern data lakes, also known as data lakehouses.

Traditional data lakes can be constructed using either a three-part or two-part data pipeline architecture. 

Two-part structure

Land Layer

Data lands in its raw state.

Consume

Data is transformed into a finished, consumable state.

Three-part structure

Land Layer

Data lands in its raw state.

Structure Layer

Data is transformed into an intermediate state.

Consume Layer

Data is transformed into a finished, consumable state in a third layer.

Open data lakehouse

Open lakehouses use modern table formats that allow for features not possible using Hive.

These include:

Modern data lakes are typically constructed using the three-part data pipeline architecture. 

File formats

Both traditional and modern data lakes make use of a variety of file formats. These control how the data is stored in the lake.

Traditional data lake file formats include: 

  • JSON 
  • XML
  • CSV

Modern data lake file formats include:

  • Parquet
  • ORC

Land layer

Let’s unpack the concept of ETL pipelines and understand how they operate at the heart of data pipelines. Data begins in source systems, and you can think of these systems as the start of the data pipeline. These systems vary but generally conform to the following pattern. 

Human-generated and machine-generated data

Source data originates from multiple sources, including: 

  • Human-generated data —for example, website click data.
  • Machine-generated data—for example, sensor data or data from instrumentation.

Multiple data structures

Like all data lakes, data in the land layer arrives in multiple structures. 

  • Structured data
  • Semi-structured data
  • Unstructured data

The raw storage principle

Data in the land layer is always stored in the same format in which it is ingested. This preserves a record of the data in its original form, before any transformations or cleansing takes place. If there is ever a need to return to this original state, this copy is a valuable resource.

You can think of this as similar to a photographer shooting photos in the RAW photo format. The RAW format includes all available information from the sensor, even information that may not actually be used. Nonetheless, the RAW format is a good choice for photography because post-production editing might make use of that information at a later date. 

In a similar way, the land layer preserves the maximum amount of information, even if it is not needed. This ensures that all available data is retained for future use.

Batch ingestion vs. streaming ingestion

Data is typically copied to the land layer using one of two methods: batch ingestion or streaming ingestion. 

Batch ingestion

Batch Ingestion is a classic approach to ingestion that involves copying a group of files at a regularly-scheduled interval or after a trigger condition has been reached. 

For example, a batch-ingested workflow may add new data to the land layer every four hours, every hour, or every fifteen minutes.

Streaming ingestion

The alternative to batch ingestion is streaming ingestion. Streaming captures new data in real time. To do this, data is divided into tiny, micro-batches continuously processed as they arrive. 

Streaming is typically handled by technologies that specialize in ingestion, including:

Kafka

A messaging function used as an intermediary agent between two systems, allowing the source system to transfer data to the Land layer reliably. Starburst Galaxy and Starburst Enterprise both integrate with Kafka

Flink

Flink is a streaming platform designed to facilitate data ingestion. Unlike Kafka, it is not considered a messaging platform. It can also be used for batch processes, and is therefore a versatile tool. Starburst Galaxy and Starburst Enterprise both integrate with Flink. 

Apache Spark

Spark is a general-purpose framework that allows for custom data ingestion tasks. Importantly, it is not a streaming or messaging platform. Instead, it is similar to a general scripting language. In that sense, it is both powerful and customizable, but also technical and complex. Many data pipeline ingestion workflows in the world run using Spark.

Advantages and disadvantages of batch ingestion and streaming ingestion

Advantages 
Disadvantages
Batch ingestion
  • Easy to audit.
  • Particularly suitable for compliance, for example, GDPR.
  • Simple to troubleshoot and fix if something goes wrong.
  • There is a delay between the batch being ingested and its availability for consumption.
  • Takes time before data can be queried by consumers downstream.
Streaming ingestion
  • Data is available in near-real time.
  • Data can be queried very quickly.
  • Very useful for rapidly-updated information.
  • Only suitable for more specific use-cases, including: 
    • Click-stream data
    • Data that has to be ingested from a data lake in: 
      • JSON
      • AVRO
  • Situations that have specific requirements for data quality that involve streaming.

Structure layer

When data is transformed from the Land layer into a new location, that new location is called the Structure layer. 

In a modern data lake or data lakehouse, the Structure layer is the source of truth for all operations and most of the major transformations occur between the Land layer and Structure layer. 

The exact nature of these transformations depends on the datasets and structures involved, but typically include the following steps. 

Normalization

Data from multiple locations usually arrives in different structures. Normalization harmonizes these differences into a common structure. To do this, divergent fields must be mapped into others. 

For example, the date format from one organization might conform to the mm-dd-yyyy structure, while the date format from another organization might conform to the dd-mm-yyyy structure. Or a Customer ID from one system might include a different number of digits than the Customer ID from another. 

If you want to be able to query data from both datasets, you are going to have to normalize the data and transform the data into a single structure with defined rules.

In the modern data lake, this typically happens after the data is queried using schema-on-read. Starburst Galaxy and Starburst Enterprise both allow you to use SQL to enact these transformations, including use of INSERT INTO commands that achieve normalization. The exact degree of normalization depends on the difference in structure between the datasets involved.

Data validation

Data validation is also used as a tool in the Structure layer. This step involves checking that all data conforms to the agreed normalization transformations and sets down rules for cleansing any data that does not conform. 

Data validation also uses SQL and both Starburst Galaxy and Starburst Enterprise make it easy to perform data validation in a data pipeline. To do this, IF statements are written to check for certain conditions using boolean logic. If the conditions are true, the data is valid; if the conditions are false, the data is not valid.

Data enrichment

The Structure layer also involves data enrichment. This occurs when the number of columns in a dataset’s schema needs to be widened to accommodate additional data. This might occur if a schema evolves beyond its initial architecture or if a new column is being added for other reasons. 

For example, a customer field might include an additional column for credit scores only when such a check is necessary. This addition would involve enrichment. Notably, enriched data is often taken from outside systems beyond the Land layer. 

Like the other transformations in the Structure layer, enrichment can be achieved using SQL in Starburst Galaxy or Starburst Enterprise. To do this, the business logic controlling the enrichment must be translated into SQL code, often involving conditional logic and IF statements. 

Sometimes this involves the use of specialized functions known as user defined functions. Functions of this type typically use the ADD function to expose the location of the enriched data.

Technical transformation

The final step is technical transformation. Once the other transformations are done, it has to be saved into a new format. This is the moment when the Structure layer is created.

Consume layer

The Consume layer contains data that has finished its journey through the data pipeline. At this stage, it is ready to be used by data consumers. 

How is data consumed?

Data is consumed in a number of different ways. These include: 

Queries

The most basic way in which data is consumed is through ad hoc queries. These might be: 

  • Views, which return data matching the query. 
  • Materialized views, which cache that data for use at a later date. 

Business Intelligence (BI) tools

Data from the consume layer is also used to feed BI tools, including:

  • Tableau
  • PowerBI

Data products

Data products are curated datasets that allow you to tag, save, and share data across teams. This exciting development empowers data consumers to take charge of their own datasets in new ways. Starburst Galaxy and Starburst Enterprise both allow use of data products.

Typical data consumption activities

Users consume data in a number of different ways, including: 

Aggregation

Aggregations involve the grouping of data according to conditional logic. Typically, these analytical queries use the SQL GROUP BY command to return values that match the conditions set out. 

Rollups and CUBEs

Sometimes multiple aggregations are necessary. In these instances, a Rollup can be used to execute multiple GROUP BY commands together. 

A special type of rollup known as a CUBE combines multiple Rollups, generating a multi-dimensional model. 

Windowing 

Another activity involves the use of Windowing functions. These can be used to create rolling averages that compare one column to another. 

For example, windowing might be used to return stock prices or average sales in real time. This is typically used by data consumers to create reports, find trends, and build actionable business intelligence. 

Tutorial 1: Lakehouse reference architecture

Setting the stage

Let’s unpack the modern data lake by exploring a data lakehouse reference architecture in more detail. The video below introduces the topic and outlines the key role played by each component in the data pipeline process.

Tutorial 2: Common vocabulary

Understanding the terminology

Modern data lakes and data lakehouses have a lot of complex vocabulary associated with them. The video below unpacks this terminology in more detail.

Tutorial 3: Technologies and tools

Knowing your toolset

Data pipelines can be constructed using a number of different technologies and tools. The video below outlines which tools you are likely to encounter in this space and where Starburst Galaxy fits.

Try Starburst Galaxy today

The analytics platform for your data lake.

Start free