Query Federation Made Simple at Comcast
Share
More deployment options
The media and telecommunications provider now known as Comcast began as a regional operator with just five channels and 12,000 customers. Today, Comcast has grown to become the largest home Internet service provider in the U.S., the parent company of NBC Universal, and more. Over time, that growth led to increasingly complex and varied data storage infrastructure across the company in which data was siloed on different platforms.
At Datanova 2021, we were lucky enough to host a session featuring Bryan Aller, a director of engineering at Comcast, and his colleague Pavan Madhineni, a principal engineer on Aller’s team. Aller and Madhineni detailed how their team developed a powerful hybrid analytics platform that allows Comcast to extract insights from data no matter where it resides and prepares the company for a more cloud-centric future. This post offers an overview of their session, but we encourage you to watch their talk, and the associated technical deep dive.
There Has to be a Better Way
Aller began by talking about how data within large organizations ends up distributed across a variety of platforms, from traditional RDBMS like Oracle, MSSQL, and MySQL to data warehouses, NoSQL databases such as Cassandra and MongoDB, or data lakes. At Comcast, more and more people started looking to do ad hoc analytics across many or all these data stores. The traditional approach to solving this problem would have been to set up a large ETL job to get the desired data onto one platform. But this approach doesn’t scale.
So, in 2016, Aller emailed his VP of Big Data Engineering and suggested that the company needed a better way to address these issues. Ultimately, they wanted to figure out how to get multiple tools, platforms, and data storage systems to speak the same language. And they needed to find a way to do so while maintaining enterprise access policies, identity management, security standards, strong performance, and more.
On-Prem & Cloud Query Engines
The team built its first on-prem query engine using Trino (formerly known as PrestoSQL) to give its end users the ability to read data in Teradata and Hadoop without having to learn different query languages or request an ETL project. Three main advantages emerged:
- The end user experience improved, as it was easier to query distributed data
- Comcast avoided the data duplication problem that can result from ETL
- The company’s storage costs were lower because it didn’t need to move data
The project was so successful internally that more teams learned about what Aller and his group had done – and wanted their own data sources to be included. Suddenly Aller and his team had to find a way to provide the same enterprise-class access to an even larger variety of data sources.
Comcast expanded its on-prem query engine so it could connect to all these new data sources and the new end users had that same great experience as the initial ones. They didn’t need to learn a new query language or copy any data. They simply applied their SQL laws and accessed new data sources while Trino did all the work in the background. By 2018, Comcast had also deployed a cloud query engine in EC2, further expanding access for its users, all without changing the basic experience.
Trino-to-Trino Connector & Starburst Data
This put the firm in a unique situation, as it now had both a cloud query engine and an on-prem query engine. As we all know, however, the Trino community is nothing if not resourceful, so Aller, Madhineni and their team built a hybrid platform with a JDBC-based Trino-to-Trino connector at its core. This way, an instance in the cloud can talk to an instance on-prem, and vice versa. Identity and enterprise security are maintained throughout, and end users can access datasets without even knowing where that data resides – on-prem or in the cloud. Whenever a user would log in, all they would see was the catalogues available to them. They didn’t need to think about where this data resided. Yet the security and access control was maintained throughout.
Once again, though, word of their ingenuity got around. More users started adopting the platform. Naturally, that meant more teams wanted to access more datasets, too, including Delta Lake, Snowflake, Elastic, and others. That’s when Comcast decided to partner with Starburst. With Starburst, Comcast strengthened its hybrid analytics platform by adding:
- Enterprise-ready, high-performance connectors to more data sources, including Delta Lake, Snowflake, Elastic, Teradata, and others
- 24/7/365 support from the Trino experts at Starburst Data
- A level of scalability that helps Comcast pull 250-300 TBs daily into the platform
Today, this hybrid query fabric platform, which includes both Trino and Starburst implementations, provides hundreds of Comcast users with access to thousands of datasets. Yet there is no data duplication or ETL. The data resides in its source system – a kind of data mesh approach. The end storage platforms have been abstracted away. Users can access everything through standard SQL.
Plus, as Aller put it in his talk: “This doesn’t just work. It’s fast.”
Here are a few of the highlights of the new hybrid analytics platform:
- Identity is managed in a singular fashion
- Data access policies are uniformly managed
- The platform is networked to look and feel like a single, seamless solution
- Hadoop jobs are running 10 to 20 times faster.
- Cross-platform jobs that used to take hours are now complete in minutes.
- Similar jobs that took minutes now only need a few seconds.
- The platform performs at scale – these are queries involving millions of rows.
- Analysts and data scientists don’t have to wait months for ETL jobs – they can explore and generate results in real time.
- The ramp speed for the general user is faster, since they’re using SQL, not learning a new platform-specific language.
Much of the underlying data storage architecture remains, but the complexity has been abstracted away, and Comcast has one universal solution that spans all of these platforms and systems and can evolve and adapt to new use cases, such as increasing use of cloud. It’s a fascinating story, and we’re happy that our technology continues to play a critical role. Given the ingenuity they’ve demonstrated these past few years, we’re also excited to see what Aller and his Comcast team accomplish next.
Hybrid Cloud: Ground to Cloud & Cloud to Cloud at Comcast