Introduction

Molecula is an enterprise feature store that simplifies, accelerates, and improves control over data to power machine-scale analytics and AI. The Molecula feature store continuously extracts and updates features to provide data engineers, data scientists, and application developers a single access point to graduate from reporting and explaining with human-scale data to predicting and prescribing real-time business outcomes with all data.

At Molecula, we believe that companies are stuck in a Data Death Spiral. Rather than an overarching, coordinated data access strategy, each big data driven project is too narrowly focused, causing further data fragmentation as data gets copied and moved from system to system. With every attempt to solve our data access woes, we are inadvertently making the problem worse because the solution requires us to copy, pre-process, and move data. Molecula is focused on helping industries escape their Data Death Spirals and access their most important data to be used for transformative AI, ML, and other big data analytical applications. The ultimate goal is to unlock human potential through data.

To dive deeper into Molecula’s approach versus traditional approaches, read our white paper “Breaking the Latency Floor”.

Core Technology

What is a feature?

Feature extraction is the process of reducing the dimensionality of data to more efficiently represent its values. A feature, or column, represents a measurable piece of data that can be used for analysis: Name, Age, Sex, Fare, and so on. Features are also sometimes referred to as “variables” or “attributes.” This technique was pioneered by data scientists who needed to prepare data for demanding machine learning and AI workloads.

What is a feature store?

A feature store is an overlay to conventional big data systems that automatically extracts features, not data, from each of the underlying data sources or data lakes and stores them into one centralized feature store. The feature store maintains up-to-the-millisecond data updates with little to no upfront data preparation. This is achieved by reducing the dimensionality of the original data, effectively collapsing conventional data models (such as relational or star schemas) into a highly-optimized format that is natively predisposed for machine-scale analytics and AI. The feature store then serves feature vectors for training and production purposes and allows for the re-use and sharing of features inside and outside of an organization. Feature stores are typically implemented and managed by data engineers and provide data scientists, ML researchers and application developers a single access point to derive insights, predictions, and real-time decisions from big data. Implementing a feature store allows companies to graduate from reporting and explaining with human-scale data to predicting and prescribing real-time business outcomes on all data.

Why use a Feature Store for all machine scale Analytics/AI projects?

Capabilities:

  • FEATURE EXTRACTION
    ‘AI Ready’ feature store that automates feature extraction and real-time updates at the source
  • SINGLE POINT OF ACCESS
    Centralized, ultra low latency access to all of your data
  • SUPPORTS ML WORKLOADS
    High concurrency queries for machine-scale analytics and ML
  • ELIMINATES PRE-PROCESSING
    Performant Joins at query time, with no pre-aggregation or pre-processing
  • FOOTPRINT REDUCTION
    Lossless reduction in data footprint, up to 85%, without copying or moving data
  • TIME
    Track and filter time at a feature level
  • CELL-LEVEL CONTROL
    Granular access control of feature sharing down to the cell level
  • OVERLAY
    Extension framework enabling seamless integration into existing environment

How does Molecula Work with your existing big data environment? 

Molecula seamlessly integrates into existing environments with a robust Data Tap ecosystem for ingest, consumption and monitoring data which integrate natively into your stack. We think it is important to have one persistent data store to collect data but we can extract features and maintain real-time upates  from any data source at the source as well. We ingest data from the sources of your choice, extract the data into a centralized feature store and then provide taps to machine learning and AI tools to consume or query the data/features.

Which integrations are available today?

Molecula currently offers two types of ‘Data Taps’: ingest Data Taps  that populate data into Molecula, and consumption Data Taps that allow end-users to interact with data in their native tools. 

Today we have Data Taps  for the most requested underlying systems, formats, and data pipelines including —

Formats:

Storage:

Data Pipelines and CDC:

Our Consumption Plugins include —

Data Science:

Visualization and Business Intelligence

  • Tableau
  • PowerBI
  • Redash.io

What is the effort and time to deploy Molecula? 

Molecula installation is straightforward and usually takes about 4 weeks to production depending on any custom development or new Data Taps required. The implementation today consists of just a few binaries and service files to be installed along with a minimal set of dependencies. We can also deploy via Docker container or via pre-built Ansible scripts. It can be installed in an on-prem environment, or inside a Linux Virtual Machine with any cloud provider. In the future we will offer a full SaaS solution.

Our Implementation team will work with you to define your data model around your source data and use cases, and help you setup and optimize your ingestion scheme. Depending on the complexity of your use cases and source data, this could take just a few hours, but many projects will run longer, to account for use case expansion as new possibilities are uncovered! 

Getting Started

Not at this time. We find Pilosa users who want enterprise support often see value in the differentiated features that come with the Molecula platform, along with the Enterprise Support included in Molecula’s licensing. The typical Pilosa use case is to execute extremely performant queries on a single index. Pilosa users tend to convert to Molecula because they need to securely run real-time advanced analytics at scale, across multiple data sources, silos, use cases, lines of business, etc.

A major differentiating factor for those customers has been the ability to execute across multiple sources, which allows for SQL JOIN functionality. Molecula also allows for more granular access control, optimized memory utilization, faster data ingest, and more.

Molecula has an Enterprise License model that allows you to buy t-shirt size packages that accommodate the number of sources and applications ingesting and consuming data from your feature store. Our open source technology Pilosa is under the Apache 2 license.

On the query side the effort is very low due to SQL support. Getting data into Molecula is facilitated by a number of pre-built Data Taps which connect to popular data stores and external tools. There are also client libraries which make it easy to build bespoke integrations if needed.

You can start with a 1-to-1 mapping to relational tables and then refine to take advantage of Molecula’s unique features. This might mean collapsing multiple tables into one VDS using Molecula’s super-performant set and time fields, removing fields that don’t need to be virtualized, or even denormalizing to reduce query complexity. While Molecula’s defaults are very powerful, there is often more than one way to map data into VDSs—it can pay big dividends to spend some effort tuning this for your particular workload.

We have a variety of ingest plugins including bulk SQL loaders, Kafka connectors (supporting Avro and the confluent schema registry), change data capture (CDC) plugins. We are constantly adding new ingest plugins and some that will be in production soon include Spark and Parquet. We have a team dedicated to building and usually takes 3-4 weeks to build an ingestion plugin from scratch if we don’t already have one for the data source you need to ingest from.

Environment

Molecula’s capabilities around access control stem directly from its core data format. In short, all data is broken down by feature, and then by value, and each value is represented as a bitmap. This means that there is absolutely no performance overhead for granting access to particular features or even particular values within the data. Granting or denying access to a particular subset of records means applying a bit mask to each query which is the most basic internal operation and is extremely optimized.

Furthermore, Molecula allows users to do something fairly unique which is separating the access to a feature into a feature map (row and column “keys”) and features. That is, because the data is broken out by feature, it’s possible to share the records that each discrete feature has without sharing the data itself (or vice-versa). This is a form of anonymization that can happen completely automatically with no overhead because you’re just choosing not to expose certain parts of the data—it’s already stored separately.

Molecula is primarily focused on opening up new use cases for our clients by shattering the latency floor compared to legacy systems. However, IT departments using Molecula often find ways to replace OLAP Cubes, Analytical Data Lakes, and other redundant systems with Molecula. When this happens, cost savings can be between 10-100x compared to the systems being replaced. This is true for the reduction of hardware footprint and for the data movement and network costs that are typically associated with information era systems.

For example, in the situations where Molecula replaces Elasticsearch in high data volume analytics use cases, we have seen upwards of 10x reduction in footprint, many orders of magnitude improvement in performance and the ability to do all of this without the typical pre-aggregation or pre-processing. 

Not currently, but this is on the roadmap for 2021.

Cloud Service options with Molecula

Yes. Molecula can run on any cloud including Azure, AWS, Google Cloud and Oracle Cloud. It can run on a Linux virtual machine or container

Yes.

Mac and Linux. Windows is not directly supported other than running in a Linux Virtual Machine.

System

When Molecula ingests data it splits the features and the feature map apart, but, crucially, it has both of them, so it can respond to queries while also being able to recreate the original data set from the information that Molecula stores.

If the feature map is either kept out of the feature store  and/ or if the feature map is de-identified, then there are essentially no values stored in the feature store. Because you aren’t making a direct copy of the data nor are you transferring it over the network, you are representing it in the best possible format for analysis and compliance.

Molecula stores data in a format that extracts features at the original data source and then compresses it for transmission and storage in the feature-store. For encryption, we rely on the filesystem or disk-level encryption, though we are considering options in the homomorphic encryption space.

People often think of data virtualization as being a layer on top of existing data access technologies — federating queries down to source systems, caching results, and providing a unified schema and language for the access of data. Molecula is a new take on Data Virtualization. Molecula’s feature store answers queries without federating them down to source systems.

It stores the data in a fundamentally different way which is naturally compressed compared to the source representation and highly efficient for machine-scale analytical workloads. We consider our modern approach to data virtualization to more closely mimic the virtualization that we’ve seen in the industry, such as compute virtualization (vmware).

The computational demand needed to abstract your source data into Molecula’s format isn’t negligible, but it has been carefully optimized. For example, for a 300GB data set of CSV files (over 1B records) it takes about 20 minutes using a single 32 core VM. Once the initial data load is complete, it takes a small fraction of the compute resources to process data and schema changes in the original data source and apply them to the feature store. A lag is usually between 5 and 2000ms behind the original data source.

Usage

Analytical workloads have been evolving for quite some time and Molecula is building on two major shifts:

  1. Shifting from databases to data formats (e.g. columnar databases to Parquet and ORC) which can be handled flexibly and take advantage of serverless offerings more easily.
  2. Shifting from serialized data formats to in-memory formats (Parquet to Arrow). This is more nascent but will continue to have a massive impact on performance and flexibility—not needing to serialize and deserialize saves huge amounts of compute, decreases latency, and makes it far less painful to move data around a distributed system.

Molecula takes these two shifts, but takes the data format itself to the next level for analytics and machine learning which is the distributed bitmap/vector model (Pilosa). Instead of representing data record-by-record or column-by-column, we break it out even further to value-by-value. With Molecula’s approach, you get more compression, less I/O per query, and benefits around access control, versioning, a super GPU friendly data format, and the opportunity for easier homomorphic encryption given how simple the basic operations are. As a result, any type of query which is operating on much or most of a dataset is almost guaranteed to be far faster given the right algorithms because there’s so much less data to move from disk->memory or memory->CPU. 

Anyone who has very large historical data sets or large volumes of streaming data, that are stored across multiple silos and geographies and is struggling to analyze and ask questions of it. Molecula is a fundamental advancement in low latency queries because of the way the data is stored and processed. Many systems have solved the problem of scale, but Molecula lowers the latency floor to the point that completely new use cases are now possible — real-time analysis and data at the speed of thought™. 

In addition, Molecula users are primarily software engineers, data engineers, and machine learning engineers who are tasked with delivering data access to people or applications that need to query, segment, analyze, and make decisions on data in real-time. Often these engineers live in either IT or directly in business units. IT commonly acts as an administrator of Molecula to enforce data access standards across the enterprise, where they can easily apply compliance best practices and other regulatory requirements to their feature store.

The benefactors of Molecula are typically data scientists, business analysts, or end user software and applications who need to process queries to make a particular business decision and do so with extremely low latency. Our customers include some of the largest, most advanced technology companies in the world and we have accelerated their hardest query times from days and hours down to fractions of a second. 

Molecula was initially designed to solve an ad hoc machine-scale analytics and ML use cases that queried high cardinality data and allowed users to drill down into and predict granular audience attributes in real-time. However, today we have also added support for highly performant queries on dense and mixed density datasets. The workloads where we add the greatest value are the complex analytical ones where source data is fragmented across silos and where a user or machine wants to apply a number of filters or criteria to a query that will return a subset of that data to take a business action on. Molecula is not designed to be a transactional system, a system of record or to fulfill single record queries (e.g. show me Tom’s record), as other database types are optimized to persist and return these queries effectively. 

Molecula is primarily queried through SQL, even if you are using our API, a Data Tap or one of our client libraries. Initially, we used a custom query language built around the core storage format, but we’ve expanded the query capabilities to the point where a significant subset of SQL is now supported, with more being added every day. Our SQL support currently encompasses a variety of WHERE clauses, GROUP BY, JOIN, ORDER BY, Sum, Count, and NOT queries. Molecula also has a Python CLI, with the ability to support Go and Java.

Molecula is best with large, fragmented, disparate data sets that have complex analytical or computational requirements or the need to combine streaming data with historical. Here are some common use cases: 

    1. Customer 360 Segmentation
    2. Accelerating Analytics
    3. Machine Learning
    4. IoT and Remote Decisioning
    5. Anomaly Detection
    6. Migration to cloud to run analytics in the cloud

There are five stages in the machine learning life cycle where data scientists are using Molecula today. 

  1. Most critically, assuming they have the proper permissions, data scientists can use a feature store to immediately and centrally access continuously updated records about the most important data in an organization. This data might include customers, patients, merchants and devices and originate from dozens or even hundreds of systems. They can now do this without having to have IT architect, deploy or manage infrastructure for each and every project.
  2. Real-time, iterative data exploration that reduces or, often, completely eliminates the long information request cycles between the data scientist and data engineer or IT.
  3. Molecula eliminates the category to integer phase of data preparation because the core data format does this natively.
  4. While data scientists can work on feature stores directly with Jupyter notebooks using our Python Client Library, they also still export from the feature store into Pandas dataframes to leverage libraries like scikit-learn and imblearn. Using a feature store to create Pandas dataframes allows data scientists to use a much larger sample size.

Molecula has an ecosystem of Data Taps that allow end users to work directly in their existing systems without having to worry about the underlying system. Additionally, we have implemented the PostgreSQL wire protocol, so any BI software that can connect to Postgres can also connect directly to Molecula. Today, our customers use Molecula to power real-time visualization and BI tools like Tableau, Power BI, and Excel.

 

Complex WHERE clauses, counts, sorts, top-n, multi-field GROUP BY, JOIN, and any combination of these. It’s limitations include processing transactions and slower query times when accessing single records