Advanced Data Management: Trino, Hadoop, and AWS for a Robust Lakehouse
Cindy Ng
Sr. Manager, Content
Starburst
Cindy Ng
Sr. Manager, Content
Starburst
Share
More deployment options
Apache Hadoop revolutionized enterprise data management by offering an open-source alternative to expensive proprietary data systems. Companies could process massive datasets using the commodity hardware in their existing data centers. In the subsequent two decades, an ecosystem of open source frameworks extended and enhanced Hadoop for other big data applications.
But Hadoop has not aged well. Its core assumptions throttle performance, undermine data democracy, and pressure data management budgets. Even worse, Hadoop’s architecture doesn’t fit with the cloud-based, real-time nature of modern enterprise data.
This post will explore why companies are migrating away from Hadoop data systems in favor of a modern data lakehouse architecture running on Amazon Web Services (AWS).
Why modernize your Hadoop environment?
Hadoop’s limitations, from performance bottlenecks to architectural complexity to user inaccessibility, increase friction within data-driven business cultures. Analyzing a company’s vast data stores takes more time and effort, delaying the insights decision-makers need to move the business forward. Although not without its challenges, modernizing from Hadoop to a data lakehouse positions companies for success.
Benefits of transitioning from Hadoop to a data lakehouse
Hadoop’s weaknesses are inherent to its design. The Hadoop Distributed File System (HDFS) poorly handles the high volume of small files generated by today’s streaming sources. MapReduce’s version of Java is so obscure that the open-source community developed Hive to give it a SQL-like skin. Hive further increased MapReduce’s latency, so the community developed alternatives like Apache Spark that replaced MapReduce entirely.
Migrating to a data lakehouse architecture lets companies benefit from a cloud-native data architecture that is more scalable, cost-effective, and performant. Lakehouses store data on commodity object storage services like Amazon S3 and Microsoft Azure Blob Storage. These services have robust systems that dynamically scale with demand. Modern file and table formats provide the metadata modern distributed query engines need to return results quickly and economically.
Challenges and considerations when modernizing Hadoop
Most companies built their Hadoop-based systems organically, adding other open-source frameworks to the core platform to create a unique data architecture. As a result, there’s no universal migration solution. This kind of transformation requires careful planning to avoid the pitfalls of such a large undertaking.
Hierarchical to flat, rigid to flexible
Object storage’s flat topology is a fundamentally different way to store data than HDFS’ hierarchical directory structure. The legacy system’s schema-on-write paradigm makes Hadoop changes challenging. On the other hand, lakehouses use schema-on-read to make data useful for unanticipated applications. Data teams must consider how moving to the new paradigm will change policies, processes, and use cases.
Modeling the migration
Enterprise data is complex, with hidden dependencies that could easily break during a migration. Engineers must model the move thoroughly to preserve data lineage and quality going into the lakehouse.
Business continuity
Downtime during the migration could significantly impact the business by disrupting inventory systems or corrupting dashboards. To minimize these risks, migration teams must phase in the migration as much as possible.
AWS Services for Building a Data Lakehouse
As the market leader in cloud computing, AWS is a safe choice for building a data lakehouse. Here are some AWS services to consider.
Leveraging AWS S3 as a Data Lake
Amazon S3 offers a global network for affordable data storage at scale. Companies can use S3 to design resilient architectures that support the fragmented nature of international data privacy regulation. Data teams can systematically manage their storage costs thanks to S3’s multiple pricing options, from high-performance, low-latency tiers to low-coast archival tiers.
AWS Glue and Apache Spark for ETL processes
AWS Glue is an essential service for ingesting data from legacy Hadoop systems into a data lakehouse running on S3. Its Data Catalog centralizes metadata and shares it across the AWS services running in a company’s AWS cluster. AWS Glue also lets engineers build Spark-based ETL pipelines with Python code.
Amazon EMR and managed Hadoop services
Moving an on-premises Hadoop infrastructure into Amazon EMR (formerly Elastic MapReduce) is one way companies reduce the risk of cloud migrations. EMR is a cloud-based managed Hadoop service that lets companies more reliably integrate legacy systems with their AWS cloud infrastructure. EMR clusters consist of Amazon EC2 instances that store data as S3 objects. EMR provides an easier path for cloud migrations but retains most of Hadoop’s limitations.
Data Storage and Management
Lakehouses streamline the storage and management of large amounts of data. With modern file and table formats, lakehouses are more suitable for the semi-structured and unstructured datasets generated by streaming sources.
File Formats: Parquet, Avro, and ORC
Modern file formats, such as the row-based Avro and the columnar formats Parquet and ORC, include big data management features. For example, these files save data to optimize read performance and use compression algorithms to optimize storage utilization.
Table formats: Iceberg and Delta Lake on AWS
Modern table formats like Iceberg and Delta Lake further accelerate query performance by letting queries find data without accessing the underlying files. These table formats support ACID-compliant transactions for applications that require guaranteed data integrity.
Data Processing and Query Engines: Apache Spark, EMR, Athena, Trino
Hadoop MapReduce’s complexity and performance issues quickly led the user community to develop alternatives. Hive patched MapReduce to improve accessibility at the expense of latency.
Apache Spark
Data scientists bypassed MapReduce to create Apache Spark, an analytics engine with optimizations for data science and machine learning workloads. Spark addresses the source of Hadoop’s performance issues by keeping all data processing in-memory rather than constantly writing and reading data to physical storage. Spark is also more accessible than either MapReduce or Hive since its version of SQL more closely aligns with the ANSI standard.
EMR + Athena + Glue
Companies running Amazon EMR’s version of Hadoop can manage data using a combination of AWS Glue and AWS Athena. Glue lets companies build ETL pipelines for Amazon EMR repositories. Athena integrates with the Glue Data Catalog to support interactive queries using standard SQL.
Trino and Starburst
When Facebook needed a better way to run interactive queries on the company’s Hadoop data warehouse, a team of developers created what would become the Trino open-source project. This massively distributed SQL query engine goes beyond running interactive queries on Hadoop by federating over fifty enterprise data sources.
Trino unifies data architectures — streaming sources, transactional databases, legacy Hadoop repositories, and more — within a virtual access layer that leaves data at the source.
The enterprise version of Trino, Starburst, extends the open-source project with performance and cost optimizations to accelerate time to insight.
Data Governance and Security
Gravity is Starburst’s universal search, discovery, and governance service. While projects like Apache Ranger address security limitations with the Hadoop ecosystem, Gravity centralizes security across all federated data sources.
Gravity’s role-based and attribute-based rules let administrators create fine-grained permissions down to the row and column levels. Companies can use Starburst to determine who may access data products and UI screens or who may see raw vs. aggregated data.
Starburst’s APIs integrate with existing security stacks, from commercial identity and authentication management (IAM) providers like Okta and Ping to open-source projects like Apache Ranger and its commercial implementation, Privacera. Immuta integration creates an invisible runtime enforcement layer that simplifies governance and speeds up authorized data access.
Hadoop case studies
More than two hundred organizations use Starburst to accelerate their data-driven decision-making cultures. Many turned to Starburst to solve the data management challenges of Hadoop infrastructures.
Improving regulatory compliance
A global investment bank turned to Starburst to modernize its Hadoop stack. Anti-money laundering (AML) regulations require the company to monitor hundreds of millions of daily transactions. At the same time, the bank operates within a global web of data sovereignty and privacy rules that limit data access and storage.
Starburst’s federated access leaves data where it lives while providing the granular controls the bank needs to enforce international data rules. With quick SQL access to data, the bank’s data scientists now develop machine learning models that support near real-time monitoring for AML violations.
Learn more about the bank’s cost and efficiency improvements.
Improving pipeline and query performance
Israel’s Bank Hapoalim was migrating a legacy system to a Hadoop-based data lake, and its Big Data team wanted a more performant alternative to Hive. Using Starburst allowed the bank to migrate data transparently into the data lake.
Starburst also promoted data-driven decision-making within Bank Hapoalim. Trino’s ANSI-standard SQL allowed a broader range of users to perform ad hoc and interactive queries, as did Starburst’s integration with SAS and Qlik.
Learn more about how Starburst accelerates Bank Hapoalim’s time-to-insight.
Modernizing Hadoop into a federated Icehouse
Starburst’s latest ebook, Modernize Hadoop to Icehouse, explains how Starburst streamlines the journey from legacy Hadoop data processing to modern Icehouse analytics.