The Art of Abstraction: the continuing separation of compute and storage for data analytics
Share
More deployment options
We recently invited 451 Research VP, Matt Aslett to share his thoughts and observations on the practice of separating the storage and computation of analytic data. That is fundamental to Presto, and allows Presto users to scale their analytic platforms so economically. Here’s what Matt had to say about the practice as a trend…
One of the key trends in the data processing and analytics sector in recent years has been the ongoing separation of compute and storage.
Traditionally, business intelligence and analytic projects have been based on the concept of the enterprise data warehouse, which saw compute and storage combined in a monolithic platform to achieve the performance required for high-performance analytics. More recently, the trend has been toward data lakes, but this was similarly based on a monolithic approach of putting all data in a single environment – initially Hadoop – for storage and analysis.
As a growing volume of data originates in the cloud, however, enterprises are increasingly taking an alternative approach of leaving the data where it resides – which increasingly means cloud storage services such as AWS’s Amazon S3, Microsoft’s Azure Data Lake Store or Google Cloud Storage – and spinning up a separate layer of compute engine products and services to analyze the data as and when required.
This approach of separating compute and storage was pioneered by Google with its Dremel research project, which inspired the emergence of distributed SQL query engines that enabled enterprises to bring their existing SQL analysis skills and tools to data stored in Hadoop and cloud storage.
This separation of compute and storage enables what might be considered an abstracted data lake architecture based on a storage layer that makes use of multiple storage engines – including HDFS itself, but also Hadoop Compatible File Storage (HCFS) such as Amazon S3 or Azure Data Lake Store, as well as relational and non-relational databases – and a separate compute layer that takes advantage of multiple distributed SQL query engines (such as Hive, Presto, Spark, Flink or Impala).
Importantly, independent compute engines such as Presto, originally created at Facebook and commercially supported by Starburst Data, can also be used to query data not just in Hadoop and object storage, but also relational and non-relational databases (including NoSQL and Kafka), providing greater flexibility by enabling federated querying of multiple data sources, in multiple locations, at the same time.
In addition to Google, other companies including Facebook, Netflix and AirBnB are enjoying the benefits of this separation of compute and storage. These benefits include the ability to scale compute and storage resources independently, which enables more cost-effective use of resources, as well as the ability to share data for multiple analytics use cases (high- and low-latency, as well as long-running and transient), using a choice of multiple compute engines, while avoiding data duplication and competition for compute resources.
Significantly, this separation of compute and storage is not limited to the cloud services but is also supported in on-premises environments using object storage. This is already happening thanks to the addition of HDFS and Amazon S3 compatibility to on-premises open source object stores such as Ceph, as well as commercial offerings from the likes of IBM and Dell EMC, and is likely to become a default option thanks to the development of the S3-compatible Apache Hadoop Ozone object file system, which is scheduled to become a core Hadoop project as of Hadoop 3.2.
Apache Hadoop Ozone – along with support for the containerization of Hadoop data processing and analytics compute services, as well as the orchestration of these services via Kubernetes – is effectively an attempt to replicate on-premises the architectural benefits of the separation of compute and storage that is available in the cloud.
Given the flexibility and cost advantages of separating compute and storage, 451 Research believes that this abstracted data architecture is the logical continuation of the core technology trends we have seen in recent years. As such, 451 Research predicts accelerated adoption of abstracted data architecture supported by the separation of compute and storage, with enterprises increasingly looking to take advantage of abstracted data discovery, processing and analytics services to generate value from investments in both on-premises and cloud storage, as well as relational and NoSQL data sources.
About the Author
Matt Aslett is a Research Vice President with responsibility for 451 Research’s Data, AI and Analytics Channel – including operational and analytic databases, Hadoop, grid/cache, stream processing, data integration, data governance, and data management, as well as data science and analytics, machine learning and AI. Matt’s own primary area of focus currently includes distributed data management, data catalogs, business intelligence and analytics, data science management, and enterprise knowledge graphs.
Matt is a regular speaker at client and industry events, and has delivered keynotes and moderated panels at Strata + Hadoop World, DataWorks Summit, Google Cloud Next, Big Data LDN, Percona Live, GraphConnect, In-Memory Computing Summit and NoSQL Roadshow. Matt has been named by AnalyticsWeek as being among the top 200 Thought Leaders in the field of Big Data and Analytics.
Prior to joining 451 Research in 2007, Matt was deputy editor of monthly magazine Computer Business Review and ComputerWire’s daily news service. Matt started in IT journalism in 1997 as a staff writer with monthly magazine Unix and NT News, where he became editor in 2000.
Matt holds a BA in multimedia journalism from Bournemouth University in the UK.