×
×

Data Glossary

ACID transactions are methods for ensuring database integrity.


AI data analytics is the application of artificial intelligence and machine learning technologies to traditional analytics.


Both data strategy and AI strategy are integral to an organization’s success in the modern technological landscape, and yet they serve distinct purpos... Learn More ›


Anti-money laundering consists of the regulations and practices used to prevent the abuse of the financial system in support of terrorism and other cr... Learn More ›


Apache Airflow is a widely adopted orchestration engine that allows you to schedule and run complex data pipelines. Airflow provides many plug-and-pla... Learn More ›


Apache Hadoop is an open-source framework for distributed storage and processing of large datasets


Apache Hive is a data warehouse system built on top of Hadoop’s distributed storage architecture.


Apache Hudi (pronounced “hoodie”) is a transactional data lake platform first developed by Uber to bring data warehouse-like analytics capabilities to... Learn More ›


Apache Iceberg is an open-source table format that adds data warehouse-level capabilities to a traditional data lake.


Impala is an SQL query engine for Hadoop-based data architectures.


The Apache Parquet file format is a way to bring columnar storage to Hadoop-based data lakes. Parquet supports efficient compression and encoding sche... Learn More ›


Apache Spark is an analytics engine built for processing massive datasets. Spark’s ability to process vast quantities of data within Apache’s big data... Learn More ›


Attribute-based access control(ABAC) is a method for dynamically applying access policies based on specific attributes of the user, the data or system... Learn More ›


Although businesses have always crunched the numbers, “business analytics” refers to a more rigorous approach that applies statistical analysis and ot... Learn More ›


Centralized data is the long-established practice of gathering all data the company generates into an enterprise database, a data warehouse, or, more... Learn More ›


Cloud data covers any data stored or processed on internet-accessible remote servers, whether company-owned or hosted by third-party cloud services.


Cloud data migration is the process that moves data from legacy systems to cloud platforms.


A cloud data warehouse is a cloud-based version of the traditional on-premises enterprise data warehouse. Given the large amounts of data businesses g... Learn More ›


A cloud-native approach to software development takes full advantage of the cloud’s scalability, elasticity, resiliency, and efficiency.


Cloud computing makes extensive use of object storage. This has many advantages, including cost, speed, and scalability.


Customer 360 is a strategic priority that requires the entire organization to create unified, end-to-end customer experiences. However, harnessing all... Learn More ›


Dark data is the dormant contents of a company’s data lakes and other repositories.


Data analytics is the process that converts raw data into actionable insights. In data-driven organizations, analytics increasingly relies on large da... Learn More ›


A data analytics architecture is a set of policies and standards that guides the organization as it builds analytical processes. More than technical o... Learn More ›


A data application (or data app) processes and analyzes big data to rapidly deliver insights or take autonomous action.


Data architecture is a framework that guides how to collect, store, manage, and use data in ways that support an organization’s business goals.


Data blending is the process of combining data sets from different data sources to generate actionable insights that answer specific business question... Learn More ›


Data catalogs are data source inventories. They collect metadata about the source’s various assets.


Data classification is a framework for organizing data in ways that improve data management, information security, and risk management.


Data complexity is an emergent property of enterprise data shaped by volume, velocity, variety, veracity, value, and vigilance — the V’s of big data.


Data compliance consists of the governance processes for meeting the requirements of internal, industry, and regulatory standards for data security an... Learn More ›


Data democratization is the goal for organizations and employees to quickly and securely access data so that they can analyze it and make data-driven... Learn More ›


Data discovery is a technique for gathering data, evaluating it for potential insights, and performing advanced analytics to create actionable insight... Learn More ›


Data engineering emerged as a specialization of software engineering in response to exploding data volumes.


Data exploration is an essential preliminary step to analyzing large datasets. Analysts use visualization and statistical methods to understand the qu... Learn More ›


A data fabric is a data management architecture that uses artificial intelligence and machine learning algorithms to automate data ingestion best prac... Learn More ›


Data federation involves the creation of a virtual database that maps an enterprise’s many different sources and makes them accessible through a singl... Learn More ›


Data governance is a concept within the discipline of data management that takes a holistic approach to an organization’s data and its lifecycle: data... Learn More ›


Data integration is a series of data management procedures for bringing datasets from different sources into data lakes, data warehouses, or other dat... Learn More ›


A data lake is a single store of data that can include structured data from relational databases, semi-structured data and unstructured data.


A Data Lake Storage houses a wide variety of data types, including structured, semi-structured, and unstructured data. Each of these data types serves... Learn More ›


Combining data lakes and data warehouses, a data lakehouse is a centralized data repository, that uses cost-effective data storage, usually in the clo... Learn More ›


Data lineage refers to the process and tools used to track the origin, movement, characteristics, and transformations of data as it flows through the... Learn More ›


A data mart is a repository of data curated to support the needs of a specific department, line of business, or business function.


Data Mesh – an approach founded by Zhamak Dehghani – refers to a decentralized, distributed approach to enterprise data management. It is a holistic c... Learn More ›


Data modernization is the process of moving data from the legacy systems of a fragmented, siloed infrastructure to an interconnected ecosystem of mode... Learn More ›


Data observability is the set of practices that help organizations understand data health and performance across the enterprise.


A data pipeline moves data from raw state to another location by executing a series of processing steps. This allows the data to be used by data consu... Learn More ›


A data platform is a technology stack or single solution for managing enterprise data. This system ingests and prepares data at scale for operational... Learn More ›


Data preparation is the process that turns raw data from disparate internal and external sources into usable datasets.


Data privacy comprises the rights of consumers to control when and how organizations may collect and use their personally identifiable information (PI... Learn More ›


Data products are curated collections of datasets and business-approved metadata designed to solve specific, targeted questions.


Data quality is the state of the data, reflected in its accuracy, completeness, reliability, relevance, and timeliness.


A data security strategy protects digital information from the consequences of human error, unauthorized access, and cyberattacks. These consequences... Learn More ›


Data sharing gives multiple users or applications simultaneous, consistent, and high-fidelity access to the same datasets.


Data silos are partially or wholly inaccessible data sets that result from a combination of technical and cultural forces. Proprietary databases and l... Learn More ›


Data sovereignty is a legal concept defining jurisdiction over data. Specifically, sovereignty establishes the principle that any data collected or st... Learn More ›


A data swamp is the inevitable outcome of a company’s misunderstanding of how data lakes work. Without a clear and well-supported big data strategy, l... Learn More ›


Data transformation is the process of converting and cleaning raw data from one data source to meet the requirements of its new location. Also called... Learn More ›


Data virtualization is a solution that creates intermediate layers between data consumers and disparate data source systems. These systems give consum... Learn More ›


A data warehouse is a central repository for structured enterprise data. These systems ingest raw data from various data sources through extract, tran... Learn More ›


A data warehouse architecture refers to how data gets loaded from source systems into data warehouses and how it is accessed by data consumers. In the... Learn More ›


A database is a large collection of data organized, for rapid search and retrieval by a computer.


Database Management System (DBMS) is used to manage a database and enables users to create, read, update, delete, and secure data within a database.


Decentralized data architectures decouple the operational plane — where and how data is stored — from the analytical plane — how the business uses dat... Learn More ›


A Delta Lake is an open-source data platform architecture that addresses the weaknesses of data warehouses and data lakes in modern big data analytics... Learn More ›


Distributed data is a practice that stores data where it lives, empowering business analysis through a single point of access.


ETL pipelines are automated data migration techniques for the ingestion of data from various sources into a target system.


Fault tolerance is the degree to which failures in a subsystem do not cause the overall system to stop operating. In the context of enterprise analyti... Learn More ›


Hybrid cloud is an architecture that manages storage, networking, and compute resources across different environments. This structure may include on-p... Learn More ›


Hypothesis-driven development (HDD), also known as hypothesis-driven product development, is an approach used in software development and product mana... Learn More ›


According to the National Institute of Standards and Technology (NIST), incident response is the reaction to violations of computer security policies... Learn More ›


Massively parallel processing is an architecture for distributing workloads across hundreds or thousands of separate processors. Although parallel com... Learn More ›


A multi-cloud infrastructure uses cloud services from one or more vendors.


Object storage is an alternative to traditional file systems for storing large amounts of unstructured data in scalable, cost-efficient, and performan... Learn More ›


An open data lakehouse is a data analytics architecture that combines a data lake’s cost-effective storage with a data warehouse’s robust analytics.


An open data warehouse is an open source alternative to monolithic, proprietary applications like Teradata or Snowflake.


An open file format is a specification for the way data gets written to storage.


Open table formats are designed to provide enhanced performance and compliance capabilities for data lakes using cloud-based object storage.


PostgreSQL is an open-source relational database management system (RDBMS) with a rich feature set, reliability, and performance that competes with a... Learn More ›


Presto SQL query engine (formerly PrestoDB) and Trino (formerly PrestoSQL) are both SQL query engines. They are both designed for high-performance SQL... Learn More ›


Query acceleration is a set of techniques for minimizing data processing workloads when analyzing a large amount of data.


A query engine takes a request for data, translates it from human to machine language, and then fulfills the request by retrieving specific data.


Risk management is the process of identifying, assessing, analyzing, prioritizing, mitigating, controlling, and monitoring potential exposures to busi... Learn More ›


Role-based access control is a system of fine-grained access privileges granted to authorized users to perform a defined set of tasks.


Schema discovery is a data engineering practice for finding and documenting the structure of data sources within a repository, such as a relational da... Learn More ›


A data lake is a centralized repository for large volumes of raw data from multiple sources that simplifies big data analytics and optimizes data infr... Learn More ›


SQL stands for structured query language. SQL is a powerful language that plays a vital role in managing and analyzing data in relational databases, m... Learn More ›


In the context of data warehousing, the star schema is a popular architecture for organizing data. It is characterized by a central fact table that is... Learn More ›


Starburst is the data company, not the candy company. Our data lakehouse platform combines the best of data lakes, data warehouses and data virtualiza... Learn More ›


Streaming data is the continuous dataflow generated by transactional systems, activity logs, Internet of Things (IoT) devices, and other real-time dat... Learn More ›


Trino is an open source distributed SQL query engine built in Java, designed to run fast analytic queries against various data sources ranging in size... Learn More ›


Unstructured data is not conformed to any preset schema or format. Traditionally, unstructured data was rare, but this has evolved due to the rise of... Learn More ›


Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.