Apache Hive

Facebook created Hive in 2008 to address some limitations of working with the Hadoop Distributed File System. The framework provides an easier way to query large datasets using an SQL-like interface.

How does Apache Hive handle big data?

Apache Hive, which is built on top of the Hadoop framework, provides businesses with the ability to distribute large computing jobs using parallel processing. With the advent of cloud-based object storage, a technological revolution was under way. But there was a problem. Hadoop was complex, especially for analytical tasks.

Hive abstracts MapReduce to present an SQL-like interface that masks MapReduce’s complexity. It does this through the Hive Query Language (HiveQL), which combines a mix of standard SQL syntax and Hive enhancements that is much easier to learn than MapReduce.

The Hive runtime uses four processes to execute HiveQL queries. A driver receives the HiveQL statements, starts execution, and collects any generated metadata and results. A compiler converts the HiveQL query into an execution plan consisting of the necessary MapReduce steps. An optimizer edits the execution plan to improve performance and scalability. Finally, the execution engine runs the plan within Hadoop.

Hive and data lakes

Hive enables Hadoop data lakes to be queried using a SQL-like interface. This allows data analysts familiar with SQL to easily query large datasets, even when the underlying data itself is not structured relationally.

To achieve this, data is queried using Hive, but held in HDFS. This data can be further segmented using partitioning and bucketing where appropriate. In this way, very large datasets are divided into smaller batches, allowing them to be processed in parallel using both Hive and Hadoop together.

Hive architecture

Users interact with the Hive architecture through a command line interface (CLI), which lets them submit queries and monitor process status. The Thrift Hive server uses protocols similar to JDBC or ODBC that let networked clients interact with Hive.

Hive’s core architecture consists of three components: the Hive metastore, table formats, and file formats.

Metastore

The metastore (formally the Hive Metastore Service) lets Hive locate data stored in a Hadoop data warehouse. It stores table metadata — including schema, location, and partitions — in a format similar to a relational database management system (RDBMS). The Hive runtime uses the metastore to map SQL tables to the contents of the HDFS.

Hive’s metastore, designed for data warehousing, imposes limitations when scaling data processing on a data lake. Unlike more modern catalog approaches like Iceberg or Delta Lake, the metastore describes the table’s metadata without providing any information about the files within the table.

Table format

Hive tables organize files in a hierarchical directory structure with folder names based on partition keys. Partitioning metadata gives queries useful information that lets them skip irrelevant files. Bucketing files within each partition provides further performance improvements.

File format

Hive supports several file formats. In addition to columnar file formats like ORC and Parquet that streamline analytical queries, Hive also supports text formats like JSON and CSV and row-based file formats like Avro.

Feature Apache Hive
Transaction support (ACID) Limited 
File format Parquet, ORC, Avro, and more
Schema evolution Partial 
Partition evolution No
Data versioning No
Time travel queries No
Concurrency control Pessimistic locking
Object store cost optimization No
Community and ecosystem Established

Apache Hive vs Apache Spark

Although Hive simplified query creation on Hadoop data warehouses, it did so by adding extra complexity. The multiple steps a user’s SQL query must go through to become executable MapReduce code takes time and compute resources. This makes HiveQL more suitable for batch processes and ETL pipelines than for interactive exploration or ad hoc queries. In addition, HiveQL and MapReduce run queries on storage instead of in memory, which adds further delays.

Rather than making MapReduce easier to use, Apache Spark replaces it altogether to avoid the penalties of HiveQL’s multi-step execution. Spark’s ability to process small operations in memory rather than in storage can speed up small workloads a hundredfold. As a result, Spark has gained popularity in the enterprise for use cases like streaming data analysis and machine learning that involve enormous volumes of small processing tasks.

Just as Spark emerged to address the limitations of HiveQL and MapReduce, the development of the open-source Trino project was driven by frustrations with Hive’s slow query turnaround. Trino now offers an alternative means for querying HDFS data lakes.

What is the difference between Hadoop and Apache Hive?

Google developed Hadoop in the early 2000s as a framework for managing large datasets across a distributed storage system. Its core features consist of a storage element called the Hadoop Distributed File System (HDFS) and a processing element called MapReduce.

HDFS is a block storage system that distributes data across multiple machines. Although initially developed to run on commodity computer clusters, enterprises can deploy HDFS systems to cloud storage services like Amazon S3.

MapReduce is a two-stage programming model for processing massive amounts of data in parallel. The map stage divides the input data into small chunks for simultaneous processing. The reduce stage aggregates the values for common keys.

However, MapReduce is challenging to work with. Even though Google based it on Java, understanding that programming language is not enough to build MapReduce jobs. Few data scientists, much less business analysts, have the specialized skills MapReduce requires. That was fine when data teams could operate in relative isolation, but business culture changed, and the need for data accessibility turned MapReduce into a significant limitation.

Employees at Facebook tried to address these limitations by developing Hive, an interface that converts SQL-like commands into MapReduce code. Hive doesn’t replace Hadoop’s file system or MapReduce. Instead, it is a usability enhancement to make Hadoop data warehouses more accessible.

Related reading: Cloud object storage vs HDFS

3 Apache Hive challenges

Hive is popular and this popularity has had a large impact on its use in data lake solutions. Like the Hadoop system it is built upon, Hive has had a massive impact on the adoption of data lakes over the last decade. Despite this, Hive presents certain challenges.

#1 Complexity: Hive provides a large number of configuration options

Although each of these were intended to extend user capabilities, navigating this ecosystem was often both confusing and complex for many users.

#2 Compared to some newer technologies, querying a data lake using Hive is slow

These limitations arise from the need to compile SQL queries into MapReduce jobs using Java. Although the use of HiveQL allows users to write queries using a familiar language, the act of translating this language into a machine readable form represents an unavoidable processing overhead to Hive queries. This makes them particularly unsuitable to ad hoc queries, which are often necessary for many organizations.

#3 Hive table format and processing overhead

The Hive table format does not allow query engines to query specific files individually without also querying all of the other files in that folder. This is also true when querying object stores. Although you can query specific objects using Hive, to find those objects it is necessary to list all of the objects in the object store. This limitation was inherited from Hadoop, and creates a significant processing overhead as it can take a long time to search through a large object store.

Related reading: Hive vs Iceberg: How to migrate your Hive tables to Apache Iceberg

Starburst presents an alternative approach to HiveQL

Trino is a massively parallel SQL query engine that accelerates big data analytics at a fraction of the cost. ANSI SQL compliant, Trino makes data accessible to anyone with SQL skills or who uses traditional business intelligence tools like Tableau.

Starburst’s data lakehouse analytics platform is based on Trino and integrates with Hadoop data lakes through Trino’s Hive connector. This integration replaces the Hive runtime with Trino, which uses the Hive metastore to find and access stored files.

This simplifies the transition from Hive to Trino since data engineering teams do not need to manage separate connectors to S3 or GCS storage services. Moreover, Trino lets consumers execute efficient, performant SQL queries on their company’s Hadoop ecosystem.

Trino’s delivers further benefits thanks to connectors that unify multiple data sources within its user interface. Users can run federated queries that are not limited to the Hadoop data lake, returning more robust results and enabling deeper business insights.

How to Use Starburst

Combining Starburst with Hive lets companies leverage capabilities beyond Trino’s core features, including:

Starburst connectors: In addition to Trino’s open-source connectors, Starburst exclusive connectors let companies connect more data sources and tap into enhanced performance and security features.

Storage federation: With over forty enterprise connectors, Starburst seamlessly unifies data sources beyond the Hive data warehouse into a single federated resource.

Integrations: Starburst integrates with existing technology stacks. For example, rather than using our internal role-based and attribute-based access controls, companies can connect Starburst Enterprise with Ranger to manage and enforce access control policies.

Streamlined ETL workloads: Starburst’s API lets data teams develop more performant, efficient, and reliable ETL pipelines for batch processing and near real-time data products.

Advanced data analysis: Starburst queries can handle large datasets with different data types. Data science increasingly depends on combining structured data and unstructured data to generate novel insights, and Starburst lets data scientists do that directly.