Data Lake Storage

Here, we explore the significance of each data type and how data lakes leverage them, along with crucial operational mechanisms like schema management, data organization, and metadata storage essential for efficient data lake functionality.

Structured data

Structured data makes up a significant amount of the data stored in any database, including a data lake. Data lakes are able to make use of structured data, and can store this data type alongside other formats. It is worth noting that although data lakes make use of raw data of any type, they can and do make use of structured data as well.

Semi-structured data

Semi-structured data is minimally structured, but has limitations to fit effortlessly into a relational database or other traditional data storage systems. Examples include JSON, XML, and CSV files.

That’s why data lakes are helpful as they are able to store semi-structured data alongside structured data, enabling data from one data type to coexist with data from the other.

Unstructured data

Unlike structured data and semi-structured data, unstructured data does not conform to any preset schema or format. As such, unstructured data is fundamentally unsuited to storage in a relational database and must be stored in a data lake.

Schema-on-read vs schema-on-write

Schemas define the structure of a dataset.

There are two main methods of organizing schemas that impact the storage of data in a data lake: schema-on-write and schema-on-read.

Schema-on-write

Schema-on-write is a data management construct where data schemas are created before the data is written to the database. When data later enters the system, it must be compliant with this schema from the outset. Data that does not fit the schema will be disallowed by the system.

This construct is closely tied to relational database management systems, and is useful in cases where the data in question already fits a specific, predictable format known in advance. Although used in some data lakes, this approach is more often associated with data warehouses.

Schema-on-read

Schema-on-read is a data management construct where a schema is validated when the data is read. Unlike schema-on-write, a schema validation is not completed when data is written to the data lake. Instead, the data is validated only when it is read. Schema-on-read is often associated with data lakes.

Indexing, partitioning, and bucketing

Data lakes do not traditionally make use of built-in indexing capabilities in the same way as relational databases. Some vendors, including starburst, are solving that problem with software enhancements.

By their nature, data lakes store large volumes of data. However, running traditional SQL queries requires the system to read every row in every table. This causes SQL run times to be long and hampers overall system efficiency.

Partitioning and bucketing address this by dividing large datasets into smaller groupings. You can think of each of these as sub-directories in a larger directory system. Their usage helps enhance the speed and efficiency of the system.

Related reading: Why partitioning doesn’t work

Storing metadata

The data inside a data lake can be stored in a number of different formats. Keeping track of this information requires the ability to keep track of these differences. This kind of data is called metadata, and the storage of metadata is an integral aspect of any data lake.

Here, we’ll learn how metadata is stored, and why it is critical in data lakes.

Structuring metadata

Metadata contains information showing how all of the data files in the data lake are organized. For this reason, even though the data lake itself is not structured, the metadata about that data is always structured and held in a separate repository.

In order to query data in a data lake, we need to understand how the data is structured. Since structure wasn’t imposed on the data when it was brought into the data lake we must supply this metadata via a metastore before the data can be effectively queried.

Metastores

A metastore is a special, dedicated repository used to store metadata relating to the data held in the data lake. It operates as a separate datastore that keeps track of the metadata for the system and fields requests about the structure of a given dataset. Typical metadata includes information relating to the storage system such as the file format, directory structure, and location of data within its files.

Two popular metastores include the Hive metastore and Amazon Glue. These services act as an intermediary between the user’s request and the datasets held in the data lake. When a request is made, information about the structure of the dataset in question is retrieved from the metastore. This information is then used to retrieve the source data from the data lake.