Extract, Transform, Load (ETL)
Extract: Data extraction is acquiring data from different sources, whether Oracle relational databases, Salesforce’s customer relationship management (CRM) system, or a real-time Internet of Things (IoT) data stream.
Transform: Data transformation involves converting data from its source schema, format, and other parameters to match the standards set for the destination. Additional data processing steps may cleanse, correct, or enrich the final data sets.
Load: The loading process deposits data, whether processed or not, into its final data store, where it will be accessible to other systems or end users, depending on its purpose.
Commonly used to populate a central data repository like a data warehouse or a data lake, ETL and ELT pipelines have become essential tools for supporting data analytics.
This guide will compare and contrast these data integration methods, explain their use cases, and discuss how data virtualization can streamline pipeline development and management — or replace pipelines altogether.
What is ELT?
ELT pipelines serve the same purpose as ETL pipelines but execute the steps in a different order. These pipelines extract and load data into the target system, after which transformation occurs. ELT pipelines are routinely used to feed raw data into data lake platforms like Microsoft’s Azure Data Lake.
Evolution from ETL to ELT
As enterprise data sources proliferated, information architectures fragmented into isolated data silos. Each database management system formatted data differently, used different querying methods, and did not share data easily. Moreover, inconsistent governance allowed domains to build their data systems to varying quality standards, schema, format, and other criteria. Data became less accessible, requiring considerable time, effort, and expense to generate valuable insights about the business.
Data warehousing attempted to resolve this situation by creating a central repository from which data analysts could easily discover and use data to support decision-makers. Data engineers had to pull information from the company’s diverse sources and impose consistency to make the data warehouse useful. Over time, a rich ecosystem of ETL tools arose to streamline and automate data warehouse ingestion.
As the era of big data emerged, this architecture began to show its weaknesses. Data warehouses lacked the flexibility and scalability to handle the ever-growing volumes, velocities, and complexities of enterprise data. Data lakes became increasingly popular. Able to leverage commodity, cloud-based data storage, lakes are more scalable and affordable. In addition, lakes can store more diverse data types in their raw formats. Since lakes store raw data, data pipelines did not need to transform data upon ingestion, leaving the data processing to each use case. This is why data engineers build ELT pipelines to fill their data lakes.
ETL vs. ELT
Wading in a little deeper than superficial name differences, the ETL vs. ELT comparison comes down to how these pipelines handle data and the data management requirements driving their development.
ETL is an established method for transferring primarily structured data from sources and processing the data to meet a data warehouse’s schema, format, and other requirements. Standardization with ETL pipelines makes data analysis more accessible to non-technical and expert users.
ELT is a more recent method for transferring structured, semi-structured, and unstructured data from sources to the raw data store of a data lake. Leaving data in its natural state creates optionality for data-driven organizations. Data scientists and business intelligence analysts can transform this data on demand to meet the requirements of their analytics projects while leaving the original data intact.
What is an ETL process?
Traditional ETL processes are multi-stage operations that ensure data entering a repository consistently meets governance and quality standards. A business need triggers pipeline development. That need may be acquiring of a new data source, changes to an existing source, or decision-makers’ demands for insights into business questions.
Prepare
The data management team will lead the resulting project, collecting input from stakeholders to define the project’s goals, objectives, and metrics. With clear direction, an engineer will explore the company’s on-premises and cloud infrastructure for the most relevant source systems and data sets.
The engineer will create data models describing the source data’s schema, data types, formats, and other characteristics. These models will define the data processing steps needed to make the source data conform to the data warehouse’s requirements.
At the same time, the engineer will request the creation of a staging area, a data store separate from sources or destinations where transformation can proceed without impacting other systems.
Extract
In this phase, the pipeline will pull data from the discovered sources and copy it into the staging area. Extraction is rarely as simple as it sounds. Each source has its own way of exporting data. Modern systems will have APIs that allow external systems to request data programmatically.
Relational database management systems will have internal query processes that generate data sets for export. Even when these databases use a commonly-used query language, such as SQL, their proprietary implementations means pipelines must use different queries to accomplish the same task with other sources.
Legacy systems, in particular, may not offer easily accessible methods to extract data. These scenarios require engineers with specialized expertise in those systems to develop ways for unlocking the data.
Transform
Once the pipeline deposits raw data from the sources into the staging area, it transforms data as specified by the data model. Simple transformations include standardizing time and date formats or converting column names and values to consistent measurement units.
However, ETL pipelines must usually perform more complex transformations. To meet the warehouse’s data quality standards, the pipeline must identify incorrect data and mitigate the errors by, for example, re-querying the source or flagging the issues for the data team. Data cleansing may also replace missing data, which may require pulling data from another source.
Given the storage constraints of data warehouse systems, ETL pipelines often simplify the extracted data to reduce the final data set’s size. ETL tools can filter out-of-range values and perform deduplication to remove data irrelevant to downstream analysis.
ETL pipelines also enrich data sets by joining data from other tables, deriving values of more use to warehouse users, and adding additional metadata to facilitate discovery.
Governance and security compliance measures can play a role in the extract, transform, and load process. For example, data privacy regulations limit who may see personally identifiable information. Pipelines can strip street addresses or replace names with anonymized identifiers. Aggregation combines individual data into cohorts or other groupings to ensure warehouse use does not lead to privacy violations. Pipelines can also address security policies — for instance, by encrypting data sets.
Regardless of the data model’s other requirements, the transformation stage has a primary job: match the transformed data set’s schema to the target system’s schema.
Load
With the final transformation complete, it’s time to load data into the target database. Naturally, the first load requires moving a complete data set into the warehouse. Using a full load the next time the pipeline runs is inefficient, so data teams will use alternative loading processes with shorter load times. Incremental loads append new data to existing tables whenever an update occurs. Batch loads add fresh data each time the pipeline runs.
Often, the loading process will overwrite data. The capacity constraints of on-premises data storage make holding old data impractical. Cloud data warehouses are a little more flexible, but scaling their storage capacities is expensive. Data teams must balance physical and budgetary limits against the business’ need for relevant data.
ETL pipelines may also overwrite data to meet compliance requirements. The company’s document retention rules determine how many months or years certain data may remain in the warehouse. Likewise, pipelines can support regulatory compliance by deleting unnecessary personal information.
Validation and audit
Throughout the ETL process, engineers must validate that everything works as expected by creating test cases to evaluate and debug pipeline execution. ETL tools can automate data validation during development and in operations. For example, pipelines can automatically confirm the transformed data set’s statistics meet expectations for size, range, minimum and maximum values, and other criteria.
Once an ETL pipeline works, there’s no guarantee that it will continue working. Changes at the source could keep a pipeline from transforming data correctly. Insufficient processing power or memory capacity may cause an ETL job to fail. Engineers must periodically audit every active pipeline to prevent bad data from contaminating the warehouse.
What is an ELT process?
The ELT process is not a one-for-one match to the ETL process but in a different order. These pipelines serve different purposes, migrate data between different systems, and handle different kinds of data. As with ETL pipelines, however, ELT pipeline development begins with a business need — new sources or new analytical requirements.
Prepare
Requirements setting, modeling, and planning are essential to the ELT process. Data teams need to understand the business requirements driving the project, the kinds of data users need, as well as the type of analysis the data will support.
However, a detailed data model is less critical to ELT pipeline development since the lake will store raw data.
Extract
The data extraction stage can be just as, if not more, complicated for ELT pipelines as it is for ETL pipelines. In addition to the source-specific methods for retrieving data from structured databases, data teams must find ways to pull semi-structured and unstructured data from many different sources.
On the other hand, many of these sources are modern systems often built on the principles of open source with standards-based methods for getting data, including JSON, XML, and so on.
Load
The handoff from the extraction to the loading phase is the most significant difference between these migration methods. Unlike an ETL pipeline that must use staging areas to minimize negative impacts on the data warehouse, an ELT pipeline drops the raw data straight into the data lake.
Since data lakes take advantage of commodity cloud storage, there’s less pressure to cull older data. ELT pipelines aren’t as useful for compliance since the raw data may not have sufficient metadata to determine whether to delete older data objects.
Transform
“ELT pipeline” is something of a misnomer in a traditional data lake. The primary pipeline’s job ends once data arrives. Any subsequent transformation happens in separate pipelines created for each analysis project. Data processing needs vary. A data scientist’s machine learning project may run best on raw data, while a business dashboard will require more involved transformation processes to display information usefully to executives.
Validation and audit
Validation processes in the ELT pipelines that to fill a data lake are not as complex as for ETL processes. Data teams must use appropriate ELT tools to ensure data ingestions occur without errors. Without any transformations to the raw data, however, there’s less to go wrong. Of course, the pipelines developed to support downstream analysis require thorough testing and validation. Regular auditing must occur if those transformation pipelines support ongoing operations.
Benefits of ETL
Within the limits of what they do, ETL pipelines are reliable methods for migrating data into a data warehouse that offer multiple benefits, including:
Skills and staffing
For all the challenges of data team staffing, companies can recruit people with extensive experience in ETL pipeline development. Engineers can tap into a well-established ecosystem of tools, resources, and best practices.
Quality and consistency
Data warehouses work best when filled with high-quality, consistently formatted data with rich metadata. ETL pipelines ingest structured data from disparate sources and apply data quality, schema, and other standards before the data reaches the warehouse.
Compliance
When an ETL pipeline loads data into a warehouse, it has already enforced security, privacy, and other governance policies. Encryption of sensitive data mitigates the risk of data breaches. Masking, aggregation, and other transformations prevent privacy violations. Enriched metadata can reinforce compliance further by enabling more granular access control rules.
Performance and efficiency
An ETL pipeline alleviates processing and storage burdens from the data sources and warehouse. A well-planned, well-modeled pipeline extracts only relevant data from sources, reducing network and compute usage while minimizing export costs.
Neither the sources nor the warehouse handle data transformation, testing, and validation. All of that occurs within the staging area’s dedicated compute and storage environment.
Data accessibility
By improving data quality and consistency, ETL data integration processes enhance the warehouse’s utility. The least technical users can find the data they need. Skilled analysts and scientists can use the warehouse’s query engine to collect and process large data sets without worrying about data quality.
Challenges of ETL for data management
ETL pipelines are brittle. Almost any change at the source will break the pipeline: queries may not work or transformations will generate errors in the face of unexpected values or formats. ETL pipelines are also sensitive to compute resources. If the compute cluster runs into processing or memory limits, the run will fail. In some cases, ETL solutions can fail gracefully and pick up where they left off. If they don’t, the data team must rerun the pipeline.
This brittleness imposes a significant maintenance burden on already over-stretched data teams. And the problem compounds as the business requests more analysis of more data from more sources.
Benefits of ELT
Likewise, ELT pipelines yield significant advantages within a data lake architecture context. The benefits of ELT pipelines include:
Availability
Extensive transformations take time to complete and check. Since ELT pipelines do not process raw data during ingestion, the data moves from source to destination much faster than through ETL pipelines. As a result, data consumers who work on raw data get to work sooner. This speed is a particular advantage for applications needing access to real-time data.
Reliability
Similarly, bypassing the transformation stage can make ELT pipelines more reliable. They are not dependent on complex transformations in which a single error can terminate a run.
Flexibility
When used with a data lake’s object storage, ELT pipelines make many more kinds of data available for analysis. The business can discover deeper insights that lead to innovations and more effective decision-making.
Affordable scalability
ELT pipelines in data lake architectures help decouple storage and compute. The pipeline loads large amounts of data into the lake’s commodity cloud storage, which scales automatically to the new demand. Any transformations are left to the data lake’s analytics layer with its own compute optimizations. In both cases, pricing from AWS and other cloud providers is much lower than what proprietary data warehouse vendors charge.
ETL and ELT use cases
Data management teams use pipelines whenever data needs to move from one location to another. However, it’s the integration of data with centralized repositories — and the analytics this migration supports — that gives pipelines their utility.
Data integration | How ETL works with a data warehouse
ETL works best when moving data between systems with highly structured sources, which is why these pipelines see widespread application in both on-premises and cloud-based data warehouses.
Data integration | How ELT works with data lakes
By contrast, ELT works best when quickly moving large volumes of data of any type. This is why ELT pipelines are such good matches for data lakes. Adoption will only increase as data volume, velocity, and complexity keep growing, along with the demands of machine learning and other advanced analytics projects.
Other data integration roles
Still, you can find use cases when ETL and ELT pipelines switch places. For example, an ELT pipeline can populate a data mart from a warehouse since the already-transformed data doesn’t need additional changes before loading. The data mart’s dashboards can apply final transformations.
Traditional data lakes lack the warehouse’s analytics resources, forcing many companies to use lakes as sources for a warehouse layer. ETL pipelines handle the lake-to-warehouse migration.
ETL and ELT made better with Starburst
Starburst’s modern data lake analytics platform simplifies pipeline development — and can eliminate data pipelines entirely. Starburst uses over fifty connectors to create a data virtualization layer that gives users access to any enterprise data source. This single point of access yields many benefits, including:
Compliance, security, and privacy
Starburst’s virtualization combines rich metadata with role-based and attribute-based access controls to enforce granular governance policies that prevent security or privacy breaches.
Data democratization
Even with these controls, Starburst makes enterprise data more accessible to more people in the organization. Data scientists and engineers can write ANSI SQL queries to pull data from multiple sources simultaneously. Additionally, analysis and visualization applications use the same SQL to support less technical data consumers.
Time to insight
Starburst accelerates data exploration and discovery to uncover deeper insights. Self-serve usage models allow people to get the data they need without sending data team requests. All of this compresses project cycles, returning richer results faster, and enabling better data-driven decisions.
In many cases, data engineers can use Starburst in place of their traditional ETL and ELT pipelines. Letting people query data on their own means engineers don’t have to develop new pipelines for every request. Furthermore, they can build products that pull data directly from sources without needing interim extract, transform, or load stages.
Using a modern data lake analytics platform like Starburst pushes data to the heart of an organization’s culture, letting data management teams help drive innovation and better strategic decision-making.