Data Transformation
Data transformation is the essential process through which data pipelines convert the structure and format of data from a source into a destination’s structure and format. Given the fragmented nature of enterprise data storage, the only way to generate holistic insights that support data-driven decision-making is to make data easily accessible through transformation.
This guide will introduce the data transformation process, its benefits, and the role it plays in extract, transform, and load data pipelines.
What is data transformation?
Data transformation is the process of converting and cleaning raw data from one data source to meet the requirements of its new location. Also called data wrangling, transforming data is essential to ingestion workflows that feed data warehouses and modern data lakes. Analytics projects may also use data transformation to prepare warehoused data for analysis.
What is the difference between data transformation and data management?
Data management consists of the technologies and practices used to handle enterprise data, from ingestion through storage, use, and disposal.
Data transformation is one element of data management. As part of data integration workflows, various types of data transformation convert the source data into the formats and structures of the destination datasets.
Why is data transformation important?
Data transformation remains indispensable because enterprise data ecosystems remain stubbornly heterogeneous despite decades of centralization and standardization initiatives. Each application and storage system takes slightly different approaches to formatting and structuring data. Organizational format, structure, and quality variations occur as business domains and regional operations develop their own data systems.
Without data transformation, data analysts would have to fix these inconsistencies each time they tried to combine two data sources. This project-by-project approach consumes resources, risks variations between analyses, and makes decision-making less effective.
The process of transforming data from multiple sources to meet a single standard improves the efficiency of a company’s data analysis operations by delivering the following benefits:
Data quality improvement
Cleaning raw data values is a basic function of transformation processes. Missing values, duplicate data, outliers, and corrupted values are fixed or mitigated to improve data quality.
Data enrichment offers further quality improvements. Other sources may replace data missing from the original. Supplemental data sources can add context that improves downstream use cases.
Data consistency
Applying company-wide data standards boosts analysis productivity. Business intelligence projects usually draw on multiple sources. Data science relies on the aggregation of big datasets. Something as simple as competing date formats throws obstacles in users’ paths.
Ensuring consistency among the datasets in the company’s data warehouses reduces the time users spend preparing data so they can focus on generating insights.
Data transformation also makes data about the data more consistent. Standardizing the schema, metadata, and other data properties improves the navigability of data infrastructure and speeds discovery.
Data integration
Integrating an enterprise’s dispersed and varied data sources creates a central repository for systems and people to access in support of the business. Whether built upon a data lake or an on-premises or cloud data warehouse, this source system combines different sources into a holistic resource for generating business insights.
Data analysis
Data transformation promotes the democratization of data access. Combining disparate data sources is a data engineering challenge. Few users have the skills to clean and enhance data from relational database management systems, much less real-time clickstreams.
Ingesting raw data through transformation pipelines results in clean, consistent data values and metadata that any analyst can use. Data becomes more accessible, empowering more people to use data and creating a data-driven decision-making culture.
Data modeling
Data models are visualizations of data flows within an information system. These models characterize the system’s data sources, define the destination’s format and other data standards, and document the system’s transformation requirements.
A data model is an essential reference for all transformation processes. Data warehousing requires models to ensure the data pool meets quality and consistency standards.
Engineers also rely on data models when working with their customers to develop data projects. For example, they will work with data scientists to define how to transform data for machine learning algorithms.
Security and compliance
Data transformation is not limited to ingestion pipelines or the preparatory phases of analytics projects. Modifying data can improve the security of data moving between systems. This kind of transformation could be something as commonplace as encrypting data transfers. Or it could be a form of data aggregation that anonymizes personal data to comply with privacy regulations.
Storage optimization
Transformation lets data managers optimize the cost and performance of their storage infrastructure. Deduplication and compression help control the amount of storage a data warehouse uses. Using transformation pipelines to improve metadata quality and better index data produces more efficient queries.
Periodically running transformation pipelines that assign usage tiers to data allow data managers to move infrequently used data to lower-cost storage options while reserving expensive, high-performance storage to the most frequently accessed data.
What is an example of data transformation? Steps to data transformation
A modern data stack uses extract, transform, load (ELT) or extract, load, transform (ETL) data pipelines to ingest data from various sources into a data warehouse. The difference between the two is whether the data pipeline transforms the data before (ELT) or after (ETL) the data reaches its destination.
Although any generalization of ELT and ETL processes will never match specific use cases, a typical data transformation process follows these steps:
1. Data discovery
Given the number of data sources available to an enterprise, the first step is identifying the most suitable datasets for a project. The data discovery process involves identifying sources that may be relevant and understanding the types of data, schemas, structures, formats, and other characteristics.
2. Data mapping
Engineers then map the data sources to the destination. The project’s data model will have defined how the new dataset should be structured and formatted to support the project’s business outcomes. Data mapping determines what changes the pipeline will apply to table headers, formats, metadata, and other aspects of each source dataset.
3. Data extraction
With all the preparation complete, engineers design the pipeline’s extraction phase. Each data source may have a different process for exporting data. Some will use variations of SQL, while others will provide APIs that return formats such as XML or JSON.
Data extraction pipelines will use the appropriate method for each source to bring data into an interim storage location.
4. Transform
With the data in temporary storage, the pipeline will apply the transformations needed to map the incoming data to its destination. This is where bad or missing data gets cleaned up, normalization reduces redundancy in tables, and additional data enriches the final dataset.
5. Review
While engineers will conduct manual reviews during pipeline development, automation is the only way to monitor the performance of high-volume pipelines. These systems will ensure that data moving from the source system through the pipeline’s data transformation tools is not inadvertently altered or corrupted.
6. Loading
Finally, the pipeline loads the ingested data into the data warehouse. The loading process could involve appending or replacing data in existing tables. In other scenarios, the load phase will create new tables. In either case, updating the destination’s data catalogs and indexes helps improve data discoverability and query performance.
Data transformation with Starburst
Starburst is a modern data lake analytics platform that unites disparate enterprise data sources within a virtual access layer. By leaving data at the source and providing a single point of access to enterprise data no matter where it lives, Starburst empowers data users and frees engineers from routine requests.
In particular, data teams can use Starburst’s integration with dbt Cloud in their pipeline development workflows to simplify data transformation. Starburst’s federated queries become the single interface dbt Cloud uses to combine data from multiple sources.
Discovery: Starburst has connectors to more than 50 enterprise data sources, including relational databases, data warehouses, and data lakes. These connectors provide type mapping, SQL support, and other features that let engineers explore any data source in the company.
Mapping: Using the information gathered in the previous stage, engineers can use dbt Cloud to map data sources to destinations.
Extraction: Starburst’s massively parallel query engine extracts data from the sources and delivers it to a temporary landing zone.
Transformation: Dbt Cloud then applies its transformations to the extracted data.
Review: Engineers use dbt Cloud to test and monitor pipeline performance.
Loading: Starburst’s query engine loads the transformed data from the landing zone to the destination data warehouse or data lake.
Once in its final location, the data is available to authorized users throughout the company. Dashboards become more effective by incorporating accurate and consistent multi-source data. Analysts become more productive when using their preferred SQL clients to run queries themselves. And data teams have more time for higher-value projects that require complex transformation pipelines.