Why Data Virtualization is Easy with a Feature Store

By: Laura Komkov

The Challenge: 

Data is not valuable if it cannot be accessed efficiently. According to IDC’s “Data Age 2025”, worldwide data is anticipated to grow by 61% from 33 zettabytes to 175 zettabytes by 2025, with an even split between the cloud and on-premises. Within enterprise organizations, that data is divided amongst, often times, 100s of disparate data sources: data warehouses, data lakes, data marts — and there is often not a single source of “data truth.” This division of data means that real-time decisions are almost impossible.

What is Data Virtualization?

Data virtualization is currently used as a bridge between multiple data sources without requiring the creation of an entirely new data platform. It is inherently aimed at producing quick, actionable insights from multiple sources without having to embark on a major data project resulting in large costs. Data virtualization allows users to access up-to-the-second, easily usable, easily understandable data. Any dataset you might need is readily available. This creates deep trust in your data and means your team or customers have more time to analyze data instead of struggling to find accurate and necessary data sets.

What is a Feature Store?

A feature store is an overlay to conventional big data systems that automatically extracts features, not data, from each of the underlying data sources or data lakes and stores them into one centralized feature store. The feature store maintains up-to-the-millisecond data updates with little to no upfront data preparation. 

The Tipping Point:

We’ve reached a tipping point in data access. The effort required to make big data accessible now often exceeds the business value that data can create. We call this the Data Death Spiral. With all of the data that exists, organizations find themselves making decisions with only about 1% of their data while simultaneously losing 4-6 weeks of time and hundreds of millions of dollars just to fulfill basic data requests.

AI, analytics and ML projects are failing at an unnecessarily high rate. Data engineering jobs grew at 122% compared to data scientists at just 40% last year. The crux of the problem? A  decade of copy-based data access techniques that require duplicating data in order to analyze it. Each duplication creates exponential costs and security risks, as well as a more complex, less clean data set. Organizations are drowning in unusable, inefficient data.

Traditional Data Access:

Data virtualization was developed to offer a single access point for analysts and data scientists to query and access enterprise data—regardless of its location. Traditionally there have been three primary approaches to data virtualization: query federation, data aggregation, and hardware acceleration.

  • Query Federation: Query federation is the most common method associated with data virtualization. This approach offers a user a singular interface to query data, while in the background it pushes the workload down to underlying systems and stitches together results. However, when querying multiple source systems, performance is greatly impacted by the slowest underlying system.
  • Data Aggregation: Due to the challenges query federation poses, some tools have implemented the ability to cache and aggregate results local to the data virtualization tool. This results in faster queries that can be returned in a few seconds, while offloading to the underlying systems, however, caching data is arguably one of the worst compromises that can be made for an organization requiring real-time data.
  • Hardware Acceleration: Solving data access problems with additional hardware is another common approach to improving performance. While adding hundreds or thousands of nodes enables the processing of truly staggering amounts of data, no matter how many nodes are thrown at a given dataset, there is always a floor on the latency that can be achieved due to the architecture of the system and how the data is stored. Unfortunately this solution creates added complexity and large increases in cost per query.

The Solution:

Molecula’s enterprise feature store is a new technology that simplifies, accelerates, and controls access to large, disparate enterprise data, while optimizing for advanced analytics, machine learning and IoT. Molecula allows organizations to securely access, query and govern data at unprecedented speeds, from disparate sources, without the need to pre-aggregate, federate, copy, cache or move the original data. Creating an overlay feature store across silos of data, organizations, or ecosystems enables an enterprise-wide, centralized access point that is nimble, secure, cost-effective, and expedient.


In Summary:

For most organizations, data is a mess — it’s a known need, but is not creating the value that it could (and should). Companies are investing huge amounts of money in data infrastructure, but that data ends up spread across numerous platforms and is often unusable in a world where real-time insights are a necessity. In the past, the solutions to the issue of disparate/siloed data have often exacerbated the problem, adding complexity, slowing the speed of data access, and creating unnecessary costs as well as security risks.

By contrast, Molecula’s enterprise feature store overlays on your existing data, giving you the ability to see immediate value. It creates a single access point for any data source, regardless of size, volume or geography, without copying, moving or caching the data. This enables insights to flow at the speed of thought across all of your data with no compromise of security, compliance, speed, size, data type, location, or format.

For a more in-depth look at how Molecula can simplify your data virtualization, check out our Next Generation Data Virtualization whitepaper.