Data Glossary

arrow_drop_down

ACID Transactions

ACID transactions are methods for ensuring database integrity.

AI Analytics

AI data analytics is the application of artificial intelligence and machine learning technologies to traditional analytics.

AI Data Strategy

Both data strategy and AI strategy are integral to an organization’s success in the modern technological landscape, and yet they serve distinct purposes.

Anti-Money Laundering

Anti-money laundering or AML consists of the regulations and practices used to prevent the abuse of the financial system in support of terrorism and other criminal activities.

Apache Airflow

Apache Airflow is an open-source data workflow management framework based on Python that makes pipelines more dynamic, extensible, and scalable than traditional ETL workflows.

Apache Hadoop (HDFS)

Apache Hadoop or Hadoop Distributed File System (HDFS) is an open-source framework for distributed storage and processing of large datasets

Apache Hive

Apache Hive is a fault-tolerant data warehouse system built on top of Hadoop’s distributed storage architecture and is used to enable analytics at scale.

Apache Hudi (pronounced “hoodie”) is a transactional data lake platform first developed by Uber to bring data warehouse-like analytics capabilities to data lakes built on the Hadoop Distributed File System (HDFS).

Apache Iceberg

Apache Iceberg or Iceberg is an open-source table format that adds data warehouse-level capabilities to a traditional data lake.

Apache Impala

Impala is an SQL query engine for Hadoop-based data architectures.

Apache Parquet

The Apache Parquet file format is a way to bring columnar storage to Hadoop-based data lakes. Parquet supports efficient compression and encoding schemes at the per-column level and includes performance features for bulk data handling at scale.

Attribute-Based Access Control (ABAC)

Attribute-based access control(ABAC) is a method for dynamically applying access policies based on specific attributes of the user, the data or system being accessed, and environmental conditions.

Business Analytics

Although businesses have always crunched the numbers, “business analytics” refers to a more rigorous approach that applies statistical analysis and other quantitative methods to solve real-world business problems.

Centralized Data

Centralized data is the long-established practice of gathering all data the company generates into an enterprise database, a data warehouse, or, more recently, a data lake.

Change Data Capture (CDC)

Change data capture (CDC) is the process of identifying incremental changes to source systems and transmitting those changes in real time to a target system, for example, from a PostgreSQL relational database to a data warehouse.

Cloud Data

Cloud data covers any data stored or processed on internet-accessible remote servers, whether company-owned or hosted by third-party cloud services.

Cloud Data Lakehouse

A cloud data lakehouse is a data platform that unifies enterprise data sources within a performant, cost-effective cloud architecture.

Cloud Data Migration

Cloud data migration is the process that moves data from legacy systems to cloud platforms.

Cloud Data Warehouse

A cloud data warehouse is a cloud-based version of the traditional on-premises enterprise data warehouse. Given the large amounts of data businesses generate daily, moving to the cloud simplifies data management and offers greater scalability.

Cloud Native

A cloud-native approach to software development takes full advantage of the cloud’s scalability, elasticity, resiliency, and efficiency.

Cloud Object Storage

Cloud computing makes extensive use of object storage. This has many advantages, including cost, speed, and scalability.

Compute Engine

A compute engine, also called a query engine or an execution engine, is a component of a data processing platform. For example, Trino interprets ANSI-standard SQL statements to create execution plans that read and process data at petabyte scales through massive parallelization.

Customer Data Platform

Customer 360 is a strategic priority that requires the entire organization to create unified, end-to-end customer experiences. However, harnessing all the data needed for this complete picture is not easy. Customer data platforms (CDPs) promise to solve this challenge by enabling you to have a full view of your customer-related datasets.

Dark Data

Dark data is the dormant contents of a company’s data lakes and other repositories.

Data Analytics

Data analytics is the process that converts raw data into actionable insights. In data-driven organizations, analytics increasingly relies on large datasets, automation, and advanced analytics to reveal otherwise invisible insights.

Data Analytics Architecture

A data analytics architecture is a set of policies and standards that guides the organization as it builds analytical processes. More than technical or capabilities requirements, this architecture links analytics to a company’s business strategy.

Data Applications

A data application (or data app) processes and analyzes big data to rapidly deliver insights or take autonomous action.

Data Architecture

Data architecture is a framework that guides how to collect, store, manage, and use data in ways that support an organization’s business goals.

Data Blending

Data blending is the process of combining data sets from different data sources to generate actionable insights that answer specific business questions.

Data Catalog

Data catalogs are data source inventories. They collect metadata about the source’s various assets.

Data Classification

Data classification is a framework for organizing data in ways that improve data management, information security, and risk management.

Data Complexity

Data complexity is an emergent property of enterprise data shaped by volume, velocity, variety, veracity, value, and vigilance — the V’s of big data.

Data Compliance

Data compliance consists of the governance processes for meeting the requirements of internal, industry, and regulatory standards for data security and privacy.

Data Democratization

Data democratization is the goal for organizations and employees to quickly and securely access data so that they can analyze it and make data-driven decisions. The goal is for businesses to achieve faster time to insights, reduce operational costs and improve business intelligence with proper data governance and data security.

Data Discovery

Data discovery is a technique for gathering data, evaluating it for potential insights, and performing advanced analytics to create actionable insights and drive effective business decisions.

Data Engineering

Data engineering emerged as a specialization of software engineering in response to exploding data volumes.

Data Exploration

Data exploration is an essential preliminary step to analyzing large datasets. Analysts use visualization and statistical methods to understand the quality and characteristics of raw data from various sources.

Data Fabric

A data fabric is a data management architecture that uses artificial intelligence and machine learning algorithms to automate data ingestion best practices, creating a data virtualization layer that eliminates silos and makes all enterprise data sources accessible to business users through knowledge graphs’ interconnected metadata.

Data Federation

Data federation or federated data, involves the creation of a virtual database that maps an enterprise’s many different sources and makes them accessible through a single interface.

Data Governance

A data governance framework is a concept within the discipline of data management that takes a holistic approach to an organization’s data and its lifecycle: data ingestion, data catalogs, retention, storage management, sharing, archiving, backup, recovery, loss prevention, removal and deletion.

Data Ingestion

Ingestion lands raw data from external sources into a central repository. From there, integration pipelines will transform data to meet data quality, formatting, and other standards.

Data Integration

Data integration is a series of data management procedures for bringing datasets from different sources into data lakes, data warehouses, or other data stores.

Data Lake Storage

A Data Lake Storage houses a wide variety of data types, including structured, semi-structured, and unstructured data. Each of these data types serves a specific purpose and brings unique value to the data ecosystem within a data lake.

Data Lakehouse

A data lakehouse, combines a data lake and a data warehouse, creating a centralized data repository, that uses cost-effective data storage, usually in the cloud, and a metadata layer to optimize compute resources for big data queries.

Data Lineage

Data lineage refers to the process and tools used to track the origin, movement, characteristics, and transformations of data as it flows through the various stages and systems within an organization.

Data Mart

A data mart is a repository of data curated to support the needs of a specific department, line of business, or business function.

Data Mesh

Data Mesh – an approach founded by Zhamak Dehghani – refers to a decentralized, distributed approach to enterprise data management. It is a holistic concept that sees different datasets as distributed products, orientated around domains.

Data Modernization

Data modernization is the process of moving data from the legacy systems of a fragmented, siloed infrastructure to an interconnected ecosystem of modern technologies, often in the cloud.

Data Observability

Data observability is the set of practices that help organizations understand data health and performance across the enterprise.

Data Pipeline

A data pipeline moves data from raw state to another location by executing a series of processing steps. This allows the data to be used by data consumers to derive business insights. Without this critical step, none of this value would be possible.

Data Platform

A data platform is a technology stack or single solution for managing enterprise data. This system ingests and prepares data at scale for operational or analytical use.

Data preparation

Data preparation is the process that turns raw data from disparate internal and external sources into usable datasets.

Data Privacy

Data privacy comprises the rights of consumers to control when and how organizations may collect and use their personally identifiable information (PII).

Data Products

Data products are curated collections of datasets and business-approved metadata designed to solve specific, targeted questions.

Data Quality

Data quality is the state of the data, reflected in its accuracy, completeness, reliability, relevance, and timeliness.

Data Security

A data security strategy protects digital information from the consequences of human error, unauthorized access, and cyberattacks. These consequences may include ransomware, brute force attack, or stolen sensitive data or corporate IP.

Data Sharing

Data sharing gives multiple users or applications simultaneous, consistent, and high-fidelity access to the same datasets.

Data Silos

Data silos are partially or wholly inaccessible data sets that result from a combination of technical and cultural forces. Proprietary databases and legacy systems may lack modern APIs to streamline data access.

Data Sovereignty

Data sovereignty is a legal concept defining jurisdiction over data. Specifically, sovereignty establishes the principle that any data collected or stored within a country is subject to its laws and regulations.

Data Swamp

A data swamp is the inevitable outcome of a company’s misunderstanding of how data lakes work. Without a clear and well-supported big data strategy, lakes quickly fill with old, irrelevant, undocumented, and inaccurate data.

Data Transformation

Data transformation is the process of converting and cleaning raw data from one data source to meet the requirements of its new location. Also called data wrangling, transforming data is essential to ingestion workflows that feed data warehouses and modern data lakes.

Data Virtualization

Data virtualization is a solution that creates intermediate layers between data consumers and disparate data source systems. These systems give consumers a unified interface to all data in the organization while leaving the data itself at the source.

Data Warehouse

A data warehouse is a central repository for structured enterprise data. These systems ingest raw data from various data sources through extract, transform, and load (ETL) pipelines to create a consistent, high-quality data pool for business intelligence data analysis.

Data Warehouse Architecture

A data warehouse architecture refers to how data gets loaded from source systems into data warehouses and how it is accessed by data consumers. In the context of a data warehouse, data consumers are usually business analysts who use the data stored in data warehouses to make informed business decisions.

Database

A database is a large collection of data organized, for rapid search and retrieval by a computer.

Database Management System

Database Management System (DBMS) is used to manage a database and enables users to create, read, update, delete, and secure data within a database.

Decentralized Data

Decentralized data architectures decouple the operational plane — where and how data is stored — from the analytical plane — how the business uses data. Rather than consolidating storage within a single, centrally-managed system, domains manage their own data while making it accessible to the rest of the organization.

Delta Lake

A Delta Lake is an open-source data platform architecture that addresses the weaknesses of data warehouses and data lakes in modern big data analytics. Also called a data lakehouse, Delta Lakes combines affordable, directly accessible storage of a data lake with the data management and performance features of a data warehouse.

Distributed Data

Distributed data is a practice that stores data where it lives, empowering business analysis through a single point of access.

Extract, Transform, Load (ETL)

ETL pipelines are automated data migration techniques for the ingestion of data from various sources into a target system.

Fault Tolerance

Fault tolerance is the degree to which failures in a subsystem do not cause the overall system to stop operating. In the context of enterprise analytics platforms, a single query error in ETL pipeline batch processes can cause significant disruptions.

Financial Analytics

Financial analytics is the application of big data analytics techniques to support data-driven decision-making, improve financial risk management, ensure regulatory compliance, and similar purposes.

Hadoop Cluster

Apache Hadoop clusters let companies manage big data processing on commodity hardware. This distributed computing model provided a more cost-effective option for enterprises locked into expensive proprietary data warehouses.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a scalable, open-source file system designed to run on commodity hardware while managing the large amounts of data enterprises generate.

Hadoop Ecosystem

The Apache Hadoop Ecosystem is a collection of open-source software projects designed to work with Hadoop distributed data processing platforms.

Healthcare Analytics

Healthcare analytics is the application of advanced data analytics solutions to the healthcare industry’s unique requirements. Unifying large volumes of data from diverse sources improves patient outcomes, accelerates drug discovery and development, and improves operational efficiency while reducing exposure to regulatory risks.

Hybrid Cloud

Hybrid cloud is an architecture that manages storage, networking, and compute resources across different environments. This structure may include on-premises data centers, on-premises or third-party hosted private cloud resources, and public cloud services.

Hypothesis-Driven Development

Hypothesis-driven development (HDD), also known as hypothesis-driven product development, is an approach used in software development and product management.

Incident Response

According to the National Institute of Standards and Technology (NIST), incident response is the reaction to violations of computer security policies that mitigates any resulting service disruptions and minimizes the loss or theft of information.

Massively Parallel Processing

Massively parallel processing is an architecture for distributing workloads across hundreds or thousands of separate processors. Although parallel computing encompasses several MPP architectures, processing large datasets typically involves MPP systems running on compute clusters, either in private data centers or, increasingly, in the cloud.

Multi-cloud

A multi-cloud infrastructure uses cloud services from one or more vendors.

Object Storage

Object storage is an alternative to traditional file systems for storing large amounts of unstructured data in scalable, cost-efficient, and performant storage architectures.

Online Analytical Processing (OLAP)

Online analytical processing (OLAP) systems are data analysis platforms that centralize large amounts of data from disparate sources.

Open Data Lakehouse

An open data lakehouse is a data analytics architecture that combines a data lake’s cost-effective storage with a data warehouse’s robust analytics.

Open Data Warehouse

An open data warehouse is an open source alternative to monolithic, proprietary applications like Teradata or Snowflake.

Open File Formats

An open file format is a specification for the way data gets written to storage.

Open Table Formats

Open table formats are designed to provide enhanced performance and compliance capabilities for data lakes using cloud-based object storage.

PostgreSQL

PostgreSQL is an open-source relational database management system (RDBMS) with a rich feature set, reliability, and performance that competes with a commercial RDBMS like Oracle.

Presto

Presto SQL query engine (formerly PrestoDB) is an earlier form of the Trino (formerly PrestoSQL) query engine. Though now separate, they share a technological lineage and are both designed for high-performance SQL queries on large datasets. Trino and Presto are often used in big data and analytics applications to query and analyze data stored in various data sources.

Query Acceleration

Query acceleration is a set of techniques for minimizing data processing workloads when analyzing a large amount of data.

Query Engine

A query engine takes a request for data, translates it from human to machine language, and then fulfills the request by retrieving specific data.

Real-Time Analytics

Real-time analytics is the ingesting, processing, and analyzing of the output from real-time data sources such as Internet of Things (IoT) sensors or network event streams. Big data analysis with minimal latency enables faster decision-making and response times.

Reference Data

Reference data categorizes information and defines the ranges of permissible values to ensure consistency in use across business processes and between organizations.

Risk Management

Risk management is the process of identifying, assessing, analyzing, prioritizing, mitigating, controlling, and monitoring potential exposures to business loss.

Role-based Access Control

Role-based access control is a system of fine-grained access privileges granted to authorized users to perform a defined set of tasks.

Schema Discovery

Schema discovery is a data engineering practice for finding and documenting the structure of data sources within a repository, such as a relational database, data warehouse, or data lake.

Schema on Read

Schema-on-read approaches only apply a schema when a query accesses a table. Any required transformations happen at runtime.

Security Lake

A data lake is a centralized repository for large volumes of raw data from multiple sources that simplifies big data analytics and optimizes data infrastructure investments. A security lake applies a data lake architecture to streamline the analysis of activity logs, event notifications, and data sourced from across an enterprise’s security and network infrastructure.

Semantic Layer

A semantic layer is an interface sitting between data consumers and enterprise data sources, abstracting the underlying data architecture.

Single Source of Truth (SSOT)

A single source of truth (SSOT) is a centralized location of master data for an organization’s decision-making processes. Theoretically, a data warehouse or other repository would create consistency and understanding based on a canonical set of data values.

Star Schema

Star schema is a popular architecture for organizing data in the context of data warehousing. It is characterized by a central fact table that is directly linked to multiple dimension tables. The dimension tables are not normalized.

Starburst

Starburst is the data company, not the candy company. Our data lakehouse platform combines the best of data lakes, data warehouses and data virtualization. Starburst, based on open-source Trino (formerly PrestoSQL), is the Data Platform for AI & ML.

Streaming Data

Streaming data is the continuous dataflow generated by transactional systems, activity logs, Internet of Things (IoT) devices, and other real-time data sources.

Structured Query Language (SQL)

Structured query language or SQL is a powerful language that plays a vital role in managing and analyzing data in relational databases.

Unstructured Data

Unstructured data is not conformed to any preset schema or format. Traditionally, unstructured data was rare, but this has evolved due to the rise of data from online sources. It is an abundant data type and represents a significant challenge due to its complexity and lack of organization.

Recent from the blog

Apache Iceberg vs Delta Lake: What are the differences?

Replace Apache Hive with Apache Iceberg to get better results from your data tables.

Hive vs Iceberg: Choosing the best table format for your analytics workload

Building Reporting Structures on S3 using Starburst Galaxy and Apache Iceberg

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy

Starburst’s mission is to free our customers to see the invisible and achieve the impossible