Cloud Object Storage

A data lake using object storage records data in a large sequence of objects. This impacts the kind of information that can be stored about objects.

Object storage cost considerations

Object storage is significantly cheaper than other alternatives, leading to massive savings for organizations. In fact, it is often the most economical way to house large amounts of data, creating a large economic incentive for businesses to make the shift to object storage. Although HDFS seemed inexpensive in its time, cloud object storage represented a paradigm shift which impacted the industry as a whole.

Distributed storage and concurrency

The ability of object storage to hold vast quantities of data also makes them ideal for workloads that read large amounts of concurrent, distributed data. This mirrors the distributed storage seen with HDFS, but improves on it considerably. Object storage offers good concurrence because it allows multiple storage servers to handle the reading of data in parallel. This approach makes object storage ideal for parallel processing applications. Query engines like Starburst make the processing of the data held in object storage faster and more efficient. How data lakes process object storage, including the use of metadata have improved in recent years with new technological paradigms!

Examples of cloud-based object storage

The three largest providers of cloud-based object storage technologies are:

  1. Amazon S3 (AWS)

  2. Microsoft Azure Blob Storage

  3. Google Cloud Storage

Advantages of cloud object storage

Data lakes constructed using cloud storage technology operate differently than their HDFS-based counterparts. These differences reflect the nature of object storage itself, and the cloud environment in which these systems run. Below we outline some of the advantages that cloud object storage has over HDFS installations.

Increase data footprint

Cloud systems allow resources to be increased to meet peak demand as needed. For example, imagine that a system exceeds its storage capacity. In an on-premises installation, additional resources would need to be purchased, hardware shipped and configured, and physical space allocated within one’s own organization. Cloud computing solves this problem by hosting vast numbers of servers around the world. Additional storage is added automatically, within seconds, in the background.

This is also true of compute resources. If additional processing power is required for analysis, this can be purchased and added to a cluster within seconds.

Decrease data footprint

Storage adaptability also occurs in reverse. If additional storage or compute resources are no longer needed, they can easily be disconnected. This helps ensure that resources are managed precisely, helping to control costs and enhancing the surge capacity of cluster deployments.

Enhanced separation of compute and storage

Both object storage and cloud computing further enhance the separation between compute and storage. Originally, compute and storage were not separated because both occurred within the same HDFS nodes. Over time, separation was introduced as part of the shift to object storage because there was no longer a need to run jobs directly on the HDFS storage nodes.

Cloud storage enforces the separation further because it was never designed to combine compute with storage. Instead, storage technologies, like AWS S3, Azure Blob Storage, or Google Cloud Storage are storage technologies that lack a query engine. This decoupling means that you only pay for the computation and storage that you use as separate items, leading to significant cost savings. It also makes robust query engines, like Starburst, an essential component of cloud object storage.

On-premises installations

With the emphasis on cloud computing often seen today, it’s important to remember that on-premises installations are still an important part of the industry. This approach uses in-house servers and dedicated computing resources to establish the infrastructure necessary to set up the data lake. Deployments of this nature have traditionally been based on HDFS, though object storage is used increasingly and has overtaken HDFS in businesses.

Because all servers must be located outside of the cloud, scaling is limited to the resources that an organization can acquire in-house. Any additional compute and storage capacity must be added manually. This can be both time-consuming and cumbersome to implement, and is often seen as a limiting factor for on-premises data lake installations.