Why Molecula’s Feature-First Approach to Big Data Access is the Future for AI
By: H.O. Maycotte
Automated feature extraction is at the core of what makes Molecula unique. The concept of extracting features from data is not new and in fact has been the first step in preparing data for AI for decades. Molecula’s FeatureBase automatically extracts features from a company or organization’s entire portfolio of data, regardless of location, size, or format. Further, FeatureBase continuously routes the extracted features as they change in real time to a feature storage platform, a rapid-access repository which serves as the nerve center for any and all real-time analytics and machine learning.
Molecula’s feature-first approach to big data access is a radical advance from conventional approaches such as federation and aggregation. This blog post sets out to provide a layperson-friendly explanation of how Molecula’s feature extraction actually works and what makes it so unique.
FIRST, A LITTLE CONTEXT BEHIND THE PROBLEM
Most big data today can be categorized as either historical or streaming and is stored in databases, data lakes, and data warehouses. These are located on premises, in the cloud, and at the edge. Important data about customers, patients, supply chains, etc. is managed by dozens and sometimes hundreds of systems. This fragmentation makes it extremely difficult to perform real-time analyses across the enterprise.
For an oversimplified example of how data is typically accessed for analysis, imagine an executive who wants to know how many patients visited three or more clinics in the last five years, broken down by geographic region and outcome. Traditionally, the appropriate queries are written by an analyst and pushed down to the individual databases to return results or, in the case of data lakes or data warehouses, a centralized copy of the data is queried. At best, an operation like this will take hours or days to prepare. The query latency alone will not support real-time results. Perhaps that executive doesn’t mind waiting hours or days, but what if he or she needs to see a continuously updated prediction of hospital bed capacity to make critical staffing decisions? Conventional approaches to fulfilling a predictive, real-time need like this can quickly turn into a major organizational undertaking.
Given the required human involvement, the unavoidable latency factors, and high costs associated with conventional data access methods, there is always a point at which it is technically impractical—and sometimes impossible—to access data at the scale and speed it’s needed.
So, the next chapter in the standard data access playbook is to create another database layer that stores only the answers to frequent queries. Common descriptions for such approaches include: indexes, OLAP cubes, columnar databases, caches, data lake engines, cloud data warehouses, lake houses, and reflections. While solving for one problem, a new layer of complexity is added. This means more copies of the original data are created, stored, and managed. With today’s relatively inexpensive cloud infrastructure, simply storing additional copies of data is not necessarily a big problem. However, the time spent making the copies, the bandwidth used transporting them across networks, the need to provision, secure, and manage them, and the human resources required for architectural design, infrastructure deployment, management, and optimization efforts can outweigh the actual realized value. This might be acceptable for human-scale BI projects, but not for machine-scale analytics, IoT and other real-time applications.
In some data science scenarios, the data is transformed and cleansed on the front end (ETL), so when the queries or the data itself need to be updated or revised, the data engineering process must be performed over and over, introducing days, weeks, or even months into the equation.
Moving data to the cloud can provide some relief in terms of resource management, but it ultimately just transfers these same problems to a new location. And, as the data scales up, so do the problems.
FeatureBase is based on a fundamentally different approach to storing data for analysis. Instead of pre-aggregating and storing all the data you need, FeatureBase extracts features from each of the underlying data sources or data lakes and stores them in a centralized feature storage platform, the most efficient data format built for advanced analytics and machine learning. FeatureBase maintains up-to-the-millisecond data updates with no upfront data preparation necessary. This is achieved by reducing the dimensionality of the original data, effectively collapsing conventional data models (such as relational or star schemas) into a lower-dimensional, highly-optimized format that is natively predisposed for machine computation. Historically, feature extraction techniques have most commonly been used by machine learning practitioners because of the massive workloads they face.
Molecula’s feature extraction technology reframes the scope of big data analytics as we know it by eliminating the need to copy, move, or federate. In fact, all operations across machine learning and analytics projects can be executed in the feature store without the need to access the actual data. Moreover, as the original data increases in size, FeatureBase does not scale at the same rate. In other words, as the data inevitably grows, the benefits grow even greater.
In Molecula’s feature-oriented format, complex dynamic JOINS and analyses are reduced to bitwise, logical computations, returning results orders of magnitude faster than traditional methods. A typical FeatureBase query will return orders of magnitude faster than a conventional data query, while maintaining 100% fidelity of the complete data set. Since FeatureBase is so accessible, JOINS and tedious ETL can be decided and executed on by application developers, analysts and data scientists at query time, making the entire process flexible and instantly adaptable to changing business needs. Most importantly, FeatureBase natively prepares all data for machine learning, AI, and today’s most demanding predictive, prescriptive and proactive analytic applications.
INTEGRATES WITH EXISTING ARCHITECTURES
Whether the data is structured, semi-structured, historical, streaming, or all of the above, FeatureBase will make it instantly accessible for real-time operations. The data can remain in the format and systems it presently resides in—or not. FeatureBase’s fully-functional representation of the data is at least an order of magnitude smaller than the original data, so any process related to using the data is less taxing by all measures. Wherever or however the data is stored, moved, or updated, it will require fewer resources to manage and access, enabling data experts to focus on extracting value from data with advancements and breakthroughs that are only possible with truly large-scale, real-time data analysis.
Molecula’s unique feature extraction technology enables real-time use cases never before possible across internal and external applications. Implementing a feature extraction and storage solution is one of the most important ways to prepare an organization for the future. Every department will benefit from having instantly-accessible data at scale. Data scientists can accelerate time from data to business outcome with instant, continuous analysis of all their data, and IT and Security can have improved control over data access, compliance risk and cost of data infrastructure. From HR and marketing, to R&D, product and corporate, organizations are now able to unlock the value that’s been hiding in their data for too long.