Introducing Molecula’s Cloud-Based Feature Store
By: Laura Komkov
What if Making Real-Time Data Machine-Ready was Easy?
Molecula’s Feature Store Simplifies the Journey to ML/AI
For years now, organizations have been drowning in data and struggling to prove business value. New technologies have emerged with the promise of making your data accessible and operational via transfer to the cloud — but, often these platforms end up moving your problems instead of fixing them. Each new addition to the advanced analytics and ML/AI stack has created one more step in the journey towards achieving positive business outcomes, and each new step requires heavy resources — investment, team, etc. Instead of inventing something next-gen, these platforms have attempted to optimize legacy technologies, but even optimized legacy technologies cannot keep up with current, machine-scale innovation.
Molecula was born out of this experience. Our team originally formed within another company — a marketing and analytics segmentation platform for media, sports and entertainment companies. With each new customer we signed, we needed to ingest from hundreds of sources, millions of customer profiles, each with millions of data points. We rapidly broke our stack — current technologies could not support our needs. We regrouped to determine if there was a new approach that could handle the volumes of data we were ingesting without sacrificing speed, latency or data quality. The new data format that we invented (and hold several patents on) is what we now refer to as a “feature-based format.” We soon realized that our feature-based format was a new way of representing and storing data that automated its preparation for AI and machine learning, and we knew we had to share this with the world, so we created Pilosa, the open-source version of our feature-first storage format. The Pilosa community has now grown to over 2,100 users.
“What if you could have a new technology that automated the process of preparing your data for machine-scale analytics and AI, while allowing your raw data to remain at the source, and being able to then leverage this across all use cases, teams and functions?”
Data preparation is essential to any successful machine-scale endeavor. In the past, preparing data for machine learning has been an arduous process: fighting with IT to get access to data, exploring those datasets, outlining and reviewing the expectations of a machine learning algorithm, selecting the specific data needed for the desired outcome, and deciding on the most appropriate data preparation techniques to transform that data into a machine-ready format, all based on the single task at-hand. This is slow and expensive, especially waiting for IT to provision the data you need to start your project.
The first step in the feature engineering process is feature extraction. Feature extraction prepares your data for the machines by abstracting complex schemas and their data into basic objects and attributes to distill a highly computable representation optimized for machine-scale analytics and applications. Once the data is in this format, it is far less expensive, requires fewer resources to process, is unimaginably fast, and opens up new opportunities across organizations to take advantage of all of their data.
All other feature stores on the market today are built on reference architectures and are focused mostly on feature re-use. This creates additional complexity, latency, speed and data gravity issues, and places limitations on how and where features can be used. These feature stores are difficult to implement, and while they can help with MLOps problems around feature re-use, they often end up creating one more silo in an organization’s already-siloed architecture and most importantly still require the agonizing battle between data engineering and data science to get the data.
Molecula’s Feature Store
Molecula’s feature store is different. It bridges the entire spectrum of data readiness, all the way from data sources to MLOps, making your most important data instantly computable. Molecula leaves data at its source and continuously extracts and updates only features into a centralized feature store. This process eliminates the need to copy, move, or pre-aggregate data, reduces the data footprint by 60-90 percent, and provides a secure data format for sharing. All of an organization’s data can be converted to reusable features and analyzed with full fidelity, regardless of format or source location, across any cloud, for immediate, millisecond analytics performance.
Molecula’s feature store is not built on any existing architectures — it is an entirely original technology, based on a new data format that can scale in never-before-seen ways without sacrificing speed or latency, and while reducing costs and data footprint. It does not create another silo, but instead eliminates existing silos, unifying access to all data for all teams. It is truly an “easy button” for accelerating machine-scale analytics and AI within enterprise organizations, including those within the life sciences, technology, financial services, and healthcare industries. With Molecula, these industries are able to personalize customer experiences, predict anomalies and fraud, diagnose patients more accurately, predict staffing needs — the options are endless.
Our feature store provides a single pane of glass for sharing all available features and securely powering all of your projects. Insights can be driven with millisecond updates from raw data sources. With Molecula’s feature store, you can nail data access and data readiness and create feature sets, re-use features, and optimize machine learning lifecycles.
Our binarized format powers our feature store and stores the relationship between the attribute and the object. The feature store then serves feature vectors that contain whatever features you have selected for training and production purposes and allows for the re-use and sharing of features inside and outside of an organization. In the feature store itself, we maintain a feature map, which is essentially the metadata that we need to translate features in and out of these entities.
We apply homomorphic compression to the data and since we’re storing features, not values, as long as you keep the feature map secure, your feature vectors are meaningless due to the nature of the compression. Features are perfect for hybrid or cloud environments because of their secure and confidential nature.
With a more simple implementation process, Molecula begins to help businesses see value almost immediately. That value is not isolated to a single team, but instead serves all relevant departments, unlocking new opportunities to capitalize on real-time, compute-ready data.