Building data lakes using AWS S3 object storage
Shaun Bruno
Marketing
Starburst
Shaun Bruno
Marketing
Starburst
Share
More deployment options
With Amazon’s Simple Storage Service (Amazon S3), the object storage solution from Amazon Web Services (AWS), you can build a scalable, cost-efficient data lake that supports advanced analytics and data-driven decision-making. For many of our customers, Starburst’s enterprise distribution of the open source Trino query engine powers the analytics of their Amazon S3 data lakes, enabling high performance queries and reducing the complexity of data ingestion.
This guide to Amazon S3 data lakes will explain their benefits and features as well as the role of Starburst’s open data lakehouse analytics solution.
What is Amazon S3?
Amazon S3 is an object-based storage service provided through the cloud infrastructure of AWS. Able to store structured, semi-structured, and unstructured data types, this service lets companies of any size build cloud-native applications, powerful analytics resources, and enterprise-class storage. Amazon S3 is known for its security, performance, and scalability, with a global footprint that makes it accessible anywhere.
Tutorial: Configure an AWS S3 catalog
Learn how to configure a catalog in Starburst Galaxy that connects to AWS S3 object storage.
Is Amazon S3 a data lake? Why use S3 as a data lake?
By itself, Amazon S3 is not a data lake — but it is an important component of data lakes at some of the largest corporations in the world.
Components of a data lake
A data lake consists of a few core elements. Users and applications interact with the lake’s access layer, powered by massively parallel query engines like Spark or Trino. The second element of a data lake further improves query performance. Open file formats like Parquet or ORC and open table formats like Iceberg or Delta Lake were designed to speed big data processing.
Underpinning this is the commodity storage and compute infrastructure of cloud computing services like AWS. Data lakes decouple compute from storage. Data teams can manage their storage systems and computing needs separately without sacrificing one for the other.
Typically, the storage side of a data lake solution uses an object-based architecture. Rather than storing data as files or blocks, these systems encapsulate data in objects along with rich metadata. Object-based storage uses a flat structure to store objects efficiently. These characteristics make object storage more cost-effective and performant.
Amazon S3 object storage in a data lake
Amazon S3 is an object storage solution that combines object storage’s properties with the near-infinite capacity of the cloud. Scalable, performant, and secure, Amazon S3 is an ideal foundation for a data lake’s big data analytics capabilities.
Other Amazon S3 use cases
Data lakes are not the only ways enterprises apply the object storage capabilities of Amazon S3. Object storage is well-suited to the needs of backup and data archiving processes. In these use cases, data gets written once and only read occasionally. Data type flexibility lets an object storage service save everything in one place to restore backups during disaster recovery or pull data from archive storage to support a business request.
What is the function of S3?
Amazon S3 lets companies store and retrieve any amount of data of any type and access it from anywhere in the world. The system holds S3 objects in a flat pool called an S3 bucket. Each object can be up to five terabytes, allowing buckets to store an enormous amount of data.
In addition to structured, semi-structured, or unstructured data, S3 objects contain prefixes, object tags, and metadata. Prefixes are shared names that allow the grouping of objects in the equivalent of a folder.
Object tags are key-value pairs that help categorize data. For example, a tag identifying objects that store personally identifiable information allows for more granular access control rules and better data privacy enforcement.
Object metadata are name-value pairs that describe the object and its contents. System-defined metadata, some Amazon-controlled and some user-controlled, records information about the object, such as its creation date or storage class. Amazon’s customers can assign user-defined metadata to make objects easier to manage and their contents faster to discover and retrieve.
The S3 platform offers various data management tools. S3 Versioning, for example, lets you keep every version of an object for future retrieval. If a user or application error corrupts an object, for example, S3 Versioning lets you retrieve and restore a particular version.
What are the features of S3?
Amazon has developed an extensive S3 feature set that lets startups and globe-spanning enterprises build their cloud infrastructure. Some features include:
S3 storage classes
Amazon S3 uses a tiered pricing structure that balances benefits like performance, accessibility, and durability with cost. S3 customers can assign objects to appropriate storage classes to optimize the performance and expenses associated with their data lakes. Some Amazon S3 storage classes include:
S3 Standard: the default option for frequently accessed data.
S3 Express One Zone: for low-latency retrieval of frequently accessed data.
S3 Intelligent-Tiering: a monitoring service that automatically moves objects to lower-cost storage classes.
S3 One Zone-Infrequent Access (S3 One Zone-IA): stores infrequently-accessed data in one availability zone rather than redundantly across three zones.
S3 Glacier Deep Archive: the lowest cost storage option for rarely-accessed data.
Access control
Amazon S3’s default controls limit user access to the buckets or objects they create. It will also block public access to all objects at the bucket or account levels. More refined access control rules are possible using the service’s identity and access management (IAM), access control list (ACL), and other features to handle authentication and permissions assignment.
Data management
Amazon S3 Object Lambda lets data teams add custom code to S3’s GET, HEAD, and LIST requests. This code can modify data while it’s being returned. Simple changes include resizing images for display on mobile devices. More complex Lambda applications could mask or delete data tagged as sensitive.
Amazon CloudWatch is an automated monitoring tool for controlling S3 costs. CloudWatch tracks AWS resources and generates alerts as charge forecasts reach pre-defined thresholds.
What are the advantages of using S3?
Amazon S3 offers several benefits for organizations developing their cloud computing infrastructure. These general benefits are particularly important for constructing an effective data lake.
Single data repository
Amazon S3’s ability to store any kind of data at petabyte scales lets it deliver the core premise of a data lake architecture: a central repository for all enterprise data. As the company generates more data from more data sources, an S3-based data lake can scale to meet the demand.
Cost and performance optimization
With eight storage classes, a suite of management tools, and efficient object storage, Amazon S3 architectures keep storage costs under control. At the same time, S3 users can allocate objects to storage classes that deliver the right balance of accessibility and latency.
Security and compliance
A secure infrastructure, support for granular access control policies, and a global footprint give Amazon S3 customers the resources they need to secure and protect their data. Object tags and metadata can reinforce regulatory compliance by, for example, ensuring data complies with European Union data sovereignty requirements.
Enterprise-scale data lake analytics
As part of a data lake architecture, Amazon S3 provides the globally accessible, scalable object storage needed to place all enterprise data at users’ fingertips. Individual data consumers can use analysis applications to uncover the trends impacting their business. Data scientists can leverage S3’s capacity to drive the most data-hungry machine learning projects.
Can I use S3 as a data warehouse?
In certain situations, companies may need to use data warehouses on top of data lakes. Warehouses have robust analytics capabilities and may be better suited to meeting service expectations. In this scenario, a data lake becomes the central repository from which a warehouse extracts its data. Since the lake’s flat storage stores raw data and the warehouse needs structured data compliant with its schema, engineers must develop ETL pipelines to handle the data transfer processes.
Many companies will choose to use Amazon Redshift’s data warehousing platform when developing this architecture in AWS, partly because Redshift’s no code/low code integration technology eliminates the need for ETL pipelines.
Starburst’s Redshift connector and support for Redshift catalogs let companies bring our performance and cost optimizations to their S3+Redshift data warehouse analytics.
Querying data in your Amazon S3 data lake
Starburst Galaxy, our modern data lakehouse platform, supports Amazon S3 catalogs, turning Starburst into the access layer for your S3-based data lake and taking the time and complexity out of your S3 data management workloads. Starburst simplifies Trino deployment by automating the open source query engine’s provisioning, configuring, and tuning operations.
In addition, connectors for over fifty other enterprise storage systems let you turn Starburst into a single point of access to data for your entire organization. Non-technical data consumers can access the data lake through visualization apps like Tableau. Trino’s support for ANSI SQL lets skilled analysts and data scientists query your S3 data lake directly, exploring its data sets and discovering data relevant to their projects.
Running Starburst Galaxy on Amazon S3 delivers unparalleled query performance and accelerated analytics as your data demand scales to exabytes.
“The decision to deploy Starburst Enterprise was made simpler because it has proven to be a reliable, fast, and stable query engine for S3 data lakes.”
— Alberto Miorin, Engineering Lead, Zalando