Accelerate AI with a data lake analytics platform

Share

As the progress of AI has increased and teams of data scientists have become more widespread, certain aspects of data science and AI have started to become more commoditized. 

It isn’t as difficult to find algorithms or technologies to do what you want as it was a few years ago and there has been an explosion of tools to assist in the field. For example, open source languages like R and Python with available packages and vibrant communities in addition to an increasing number vendor technologies are all aimed at improving the efficiency of expensive data science and AI resources. 

Don’t get me wrong, we are a long way from mastering the discipline. But how often are the answers from AI used effectively? The challenges with AI today are about scaling applications and making the output consumable. AI and data science processes are complicated but the answers we get from them should not be complicated. 

Data products for your AI and data science teams

Project specific demands and the unique, differing needs between departments means that data engineers and architects are often faced with needing to support a variety of user tools. That can be challenging as the output of AI has been isolated to specific groups. 

That brings me to data products. The trend that I see is with AI engines and teams building reusable data products. The output and the data that is used to train models, is shared between groups. That sharing happens in the  form of a data product. “Data products are curated datasets packaged to create value for downstream consumers”. For an excellent introduction into data products see Vishal and Monica’s blog post

Data products logically span systems — be they legacy or modern. Ask the question what data do my consumers need for their projects? What reusable data sets can I create that are able to feed my models and business domains? Then you backtrack and create governed data products that meet those needs. 

Data products should created in a consumption layer or abstraction layer as opposed to directly in a data source or client or modelling tool. The reasons for that are flexibility, reusability and security. You want to pull aggregation rules, such as materialized and logical views, out of data sources and client tools and put them in the middle consumption layer. That allows you to use those rules reproducibly across AI solutions and data sources. 

It also makes it possible to create a single point of security for enforcement of access controls to data. This allows engineering teams to bridge legacy with modern infra and decouple the need for end users to have to understand where the data lives. 

Scaling AI across the organization

When thinking about how to scale out data and analytics cross the organization dealing with legacy environments is a common challenge especially for large, established enterprise organizations and government agencies. 

What is the best way to feed an AI engine when data exists across modern and legacy systems? Although the best answer is dependent upon where you started most organizations at this point would agree that trying to migrate all of your data into one warehouse isn’t realistic. 

ChatGPT isn’t powerful solely for lines of code but because it has access to enormous amounts of data. What if ChatGPT had to wait until everyone in the world centralized their data in a warehouse? What year do you think it would it be? That isn’t a feasible approach at the ChatGPT level, the organizational level or the department level. The challenges and the issues of scale are about scaling applications across complex data ecosystems. The race right now is to the data. 

The value of AI is dependent upon the quality and timeliness of data. 

Today’s environments tend to be complex ecosystems comprised of multiple technologies. That essentially results in a challenge of providing the right data, to the right people, in the right form on demand. 

In order to meet these demands innovations that support data lake and lakehouse analytics in addition to distributed architectures have gained in popularity and momentum. Data lakes and lakehouses built on top of inexpensive object storage such as S3, ADLS and GCS make it possible to scale large volumes of data for analytics economically. 

In a distributed architecture the data lake is the center of data gravity while access to enriching data from other systems is provided through a federation engine as needed. Data mesh has become a powerful strategic approach to solving that problem. With a data mesh the responsibility of owning and serving data is decentralized across business domains and domain owners. 

Read Defining Data Mesh to learn more. 

Organizing access through data products and domains is central to a Data Mesh

A data lake analytics platform is needed in order to bridge the gap between what can be a large number of analytical AI tools with data lakes, lakehouses, legacy systems and other technologies in the ecosystem. 

Data scientists shouldn’t have to worry about where the data lives or if it is in a legacy system. So how do engineering and architecture teams achieve that for their customers? Leverage a consumption layer that decouples end users from the data to provide fast access capable of bridging legacy with modern systems. 

Security: centralized governance with federated access

There is a natural tension between security and providing self service access to data. Data, however, doesn’t have to be centralized in a data warehouse in order for it to be secure. We have to break that mindset and embrace an approach of centralized governance and controls with distributed access. 

Security policies, authorization & authentication rules should be managed centrally. Those policies, however, should be applied in a distributed fashion across data lakes and data silos. Once again we leverage our consumption layer. The approach is to pull authorization policies out of client tools and data sources and enforce them in our consumption layer. One location for governing and maintaining security policies, applied in a single consumption layer, which allows distribution of those rules across each of your data sets and clients. 

Scaling AI with the right infrastructure

There are a couple of key concepts to keep in mind when setting up the right infrastructure to support access and security while ensuring the flexibility to take advantage of new technologies to support AI as they emerge.

1. Leverage a data lake as the center of gravity

Data lakes have arisen as a solution to the challenges of centralizing data in a data warehouse. They allow for more economical storage of data in either raw form as ingested from source systems or transformed in some way according to business rules in a typical zoned data lake approach. Data lakes solve for the scale, openness and interoperability with other technologies and data ownership challenges of warehouses. 

2. Use open data formats

Open data formats go hand in hand with data lakes and lakehouses. Storing data in open formats such as ORC, Parquet and Avro eliminate vendor locking while making it possible to achieve excellent performance for analytical workloads. 

3. Future proof your architecture with a data lake analytics platform

Leverage a consumption layer to bring reusable logic out of client tools and data sources where it can be reused and applied broadly. Focus that reusable logic into data products as noted above. Embrace separation of compute from storage to take advantage of cloud elasticity.