Introduction

Molecula is an enterprise feature store that simplifies, accelerates, and improves control over data to power machine-scale analytics and AI. The Molecula feature store continuously extracts and updates features to provide data engineers, data scientists, and application developers a single access point to graduate from reporting and explaining with human-scale data to predicting and prescribing real-time business outcomes with all data.

At Molecula, we believe that companies are stuck in a Data Death Spiral. Rather than an overarching, coordinated data access strategy, each big data driven project is too narrowly focused, causing further data fragmentation as data gets copied and moved from system to system. With every attempt to solve our data access woes, we are inadvertently making the problem worse because the solution requires us to copy, pre-process, and move data. Molecula is focused on helping industries escape their Data Death Spirals and access their most important data to be used for transformative AI, ML, and other big data analytical applications. The ultimate goal is to unlock human potential through data.

To dive deeper into Molecula’s approach versus traditional approaches, read our white paper “Breaking the Latency Floor”.

Simplify big data infrastructure with no compromise

  • Access to 100% of your data
  • Simplified schema with more simplified set up
  • No pre-aggregation 
  • No copy, caching of data

Accelerate time from data to decision

  • Instant, continuous insights
  • 1000x Faster BI/ML
  • Reduce Data delivery cycles
  • Reduce time integrating disparate datasets

Control data access, compliance risks and cost

  • Securely share data 
  • Meet Compliance requirements 
  • Reduce Data sprawl
  • Reduce Data footprint and cost 10-100x

Molecula seamlessly integrates into existing environments with a robust plugin ecosystem for ingest, consumption and monitoring data which integrate natively into your stack. We think it is important to have one persistent data store to collect data but we can virtualize any data source at the source as well. We ingest data from the sources of your choice, abstract the data into a representation of that data and then provide connectors/plugins to your favorite BI/Data Science or custom analytic application.

Molecula installation is straightforward, and consists of just a few binaries and service files to be installed along with a minimal set of dependencies. We can also deploy via Docker container or via pre-built Ansible scripts. It can be installed in an on-prem environment, or inside a Linux Virtual Machine with any cloud provider.

Our Implementation team will work with you to define your data model around your source data and use cases, and help you setup and optimize your ingestion scheme. Depending on the complexity of your use cases and source data, this could take just a few hours, but many projects will run longer, to account for use case expansion as new possibilities are uncovered!

Getting Started

We have a variety of ingest plugins including bulk SQL loaders, Kafka connectors (supporting Avro and the confluent schema registry), change data capture (CDC) plugins. We are constantly adding new ingest plugins and some that will be in production soon include Spark and Parquet. We have a team dedicated to building and usually takes 3-4 weeks to build an ingestion plugin from scratch if we don’t already have one for the data source you need to ingest from.

You can start with a 1-to-1 mapping to relational tables and then refine to take advantage of Molecula’s unique features. This might mean collapsing multiple tables into one VDS using Molecula’s super-performant set and time fields, removing fields that don’t need to be virtualized, or even denormalizing to reduce query complexity. While Molecula’s defaults are very powerful, there is often more than one way to map data into VDSs—it can pay big dividends to spend some effort tuning this for your particular workload.

On the query side the effort is very low due to SQL support. Getting data into Molecula is facilitated by a number of pre-built integration plugins which connect to popular data stores and tools. There are also client libraries which make it easy to build bespoke integrations if needed.

Molecula has an Enterprise License model that allows you to buy t-shirt size packages of VDSs and each package comes with one VDS manager and implementation services. If the size of your data sources are beyond a certain size then it will go against your inventory of VDSs. You can then buy additional packages of VDSs. Our open source technology Pilosa is under the Apache 2 license.

Not at this time. We find Pilosa users who want enterprise support often see value in the differentiated features that come with the Molecula platform, along with the Enterprise Support included in Molecula’s licensing. The typical Pilosa use case is to execute extremely performant queries on a single index. Pilosa users tend to convert to Molecula because they need to securely run real-time advanced analytics at scale, across multiple data sources, silos, use cases, lines of business, etc.

A major differentiating factor for those customers has been the ability to execute across multiple sources, which allows for SQL JOIN functionality. Molecula also allows for more granular access control, optimized memory utilization, faster data ingest, and more.

Environment

Molecula’s capabilities around access control stem directly from its core data format. In short, all data is broken down by field, and then by value, and each value is represented as a bitmap. This means that there is absolutely no performance overhead for granting access to particular fields or even particular values within the data. Granting or denying access to a particular subset of records means applying a bit mask to each query which is the most basic internal operation and is extremely optimized.

Furthermore, Molecula allows users to do something fairly unique which is separating the access to a field into “keys” and “relationships”. That is, because the data is broken out by value, it’s possible to share the records that each value has without sharing the values themselves (or vice-versa). This is a form of anonymization that can happen completely automatically with no overhead because you’re just choosing not to expose certain parts of the data—it’s already stored separately.

As a motivating example, you might have a huge dataset on-premises which you want to put through some computationally intensive algorithmic analysis (think clustering). Perhaps you don’t have access to a huge amount of elastic compute on-prem, so you want to use temporary cloud resources. A clustering analysis doesn’t need to know what the actual values are, it only needs to know about the relationships and have some way to refer to the values when it returns the results. With Molecula you can export just the relationships (and save on bandwidth cost/time in the process), run your analysis on that data, and translate the results back to the actual values on prem. So you’ve significantly decreased your exposure from a security perspective and used far less bandwidth due to having granular control over what data you share, and how you share it.

Molecula is primarily focused on opening up new use cases for our clients by shattering the latency floor compared to legacy systems. However, IT departments using Molecula often find ways to replace OLAP Cubes, Analytical Data Lakes, and other redundant systems with Molecula. When this happens, cost savings can be between 10-100x compared to the systems being replaced. This is true for the reduction of hardware footprint and for the data movement and network costs that are typically associated with information era systems.

For example, in the situations where Molecula replaces Elasticsearch in high data volume analytics use cases, we have seen up to a 100x reduction, at least a 1000x improvement in performance and the ability to do all of this without the typical pre-aggregation or pre-processing. 

Not currently, but this is on the roadmap for 2020.

Yes. Any service that provides a Linux Virtual Machine will support Molecula.

Yes.

Mac and Linux. Windows is not directly supported other than running in a Linux Virtual Machine.

System

When Molecula ingests data it splits the values and the relationships apart, but, crucially, it has both of them, so it can respond to queries while also being able to recreate the original data set from the information that Molecula stores. If the Translation Keys are either kept out of the VDS and/ or if the keys are de-identified, there are essentially no values stored in the VDS.

However, technically you can reconstruct the data stored in the relationships using the keys, so it’s not zero-copy in the absolute sense of the word, but you aren’t making a direct copy of the data nor are you transferring it over the network, you are representing it in the best possible format for analysis and compliance.

Molecula stores data in a format that translates the original data source into an abstraction and then compresses it. For encryption, we rely on the filesystem or disk-level encryption, though we are considering options in the homomorphic encryption space.

People often think of data virtualization as being a layer on top of existing data access technologies — federating queries down to source systems, caching results, and providing a unified schema and language for the access of data. Molecula is a new take on Data Virtualization. Molecula answers queries without federating them down to source systems. It stores the data in a fundamentally different way which is naturally compressed compared to the source representation and highly efficient for analytical workloads.

We consider our modern approach to data virtualization to more closely mimic the virtualization that we’ve seen in the industry, such as compute virtualization (VM), storage virtualization (SDS), and network virtualization (SDN). The final pillar is data virtualization, and we are determined to establish a software-defined data (SDD) standard.

The computational demand needed to abstract your source data into Molecula’s format isn’t negligible, but it has been carefully optimized. For example, for a 300GB data set of CSV files (over 1B records) it takes about 20 minutes using a single 32 core VM. Once the initial data load is complete, it takes a small fraction of the compute resources to process data and schema changes in the original data source and apply them to the VDS. A VDS lag is usually between 5 and 2000ms behind the original data source.

Usage

Analytical workloads have been evolving for quite some time and Molecula is building on two major shifts:

  1. Shifting from databases to data formats (e.g. columnar databases to Parquet and ORC) which can be handled flexibly and take advantage of serverless offerings more easily.
  2. Shifting from serialized data formats to in-memory formats (Parquet to Arrow). This is more nascent but will continue to have a massive impact on performance and flexibility—not needing to serialize and deserialize saves huge amounts of compute, decreases latency, and makes it far less painful to move data around a distributed system.

Molecula takes these two shifts, but takes the data format itself to the next level for analytics and machine learning which is the distributed bitmap/vector model (Pilosa). Instead of representing data record-by-record or column-by-column, we break it out even further to value-by-value. With Molecula’s approach, you get more compression, less I/O per query, and benefits around access control, versioning, a super GPU friendly data format, and the opportunity for easier homomorphic encryption given how simple the basic operations are. As a result, any type of query which is operating on much or most of a dataset is almost guaranteed to be far faster given the right algorithms because there’s so much less data to move from disk->memory or memory->CPU.

Anyone who has very large historical data sets or large volumes of streaming data, that are stored across multiple silos and geographies and is struggling to analyze and ask questions of it. Molecula is a fundamental advancement in low latency queries because of the way the data is stored and processed. Many systems have solved the problem of scale, but Molecula lowers the latency floor to the point that completely new use cases are now possible — real-time analysis and data at the speed of thought™. 

In addition, Molecula users are primarily software engineers, data engineers, and machine learning engineers who are tasked with delivering data access to people or applications that need to query, segment, analyze, and make decisions on data in real-time. Often these engineers live in either IT or directly in business units. IT commonly acts as an administrator of Molecula to enforce data access standards across the enterprise, where they can easily apply compliance best practices and other regulatory requirements to their VDSs.

The benefactors of Molecula are typically data scientists, business analysts, or end user software and applications who need to process queries to make a particular business decision and do so with extremely low latency. Our customers include some of the largest, most advanced technology companies in the world and we have accelerated their hardest query times from days and hours down to fractions of a second. 

Molecula was initially designed to solve an ad hoc analytics use case that queried high cardinality data and allowed users to drill down into granular audience attributes in real-time. However, today we have also added support for highly performant queries on dense and mixed density datasets. The workloads where we add the greatest value are the analytical ones where a user or machine wants to apply a number of filters or criteria to a query that will return a subset of that data to take a business action on. Molecula is not designed to be a transactional system, a system of record or to fulfill single record queries (e.g. show me Tom’s record), as other database types are optimized to persist and return these queries effectively.

Molecula is primarily queried through SQL, even if you are using our API, a consumption plug-in or one of our client libraries. Initially, we used a custom query language built around the core storage format, but we’ve expanded the query capabilities to the point where a significant subset of SQL is now supported, with more being added every day. Our SQL support currently encompasses a variety of WHERE clauses, GROUP BY, JOIN, ORDER BY, Sum, Count, and NOT queries. Molecula also has a Python CLI, with the ability to support Go and Java.

Molecula is best with large, fragmented, disparate data sets that have complex analytical or computational requirements or the need to combine streaming data with historical. Here are some common use cases: 

  1. Customer 360 Segmentation
  2. Accelerating Analytics
  3. Machine Learning
  4. IoT and Remote Decisioning
  5. Anomaly Detection
  6. Migration to cloud to run analytics in the cloud

There are four stages in the machine learning life cycle where data scientists are using Molecula today. 

  1. Real-time, iterative data exploration that reduces or, often, completely eliminates the long information request cycles between the data scientist and data engineer or IT.
  2. Molecula eliminates the category to integer phase of data preparation because the core data format does this natively.
  3. While data scientists work on VDSs directly with Jupyter notebooks using our Python Client Library, they also still export VDSs into Pandas dataframes to leverage libraries like scikit-learn and imblearn. Using VDSs to create Pandas dataframes allows data scientists to use a much larger sample size.
  4. Finally, we have opened an interface to inject arbitrary code directly on the core data format of the VDS. We have had machine learning engineers use this to apply unsupervised algorithms like bicliques to surface clustered insights on VDSs. More forward-looking, we have customers who are experimenting with using VDSs to track data versioning for retraining production models and batch inferencing. 

Molecula has an ecosystem of consumption plug-ins that allow end users to work directly in their existing systems without having to worry about the underlying system. Additionally, we have implemented the PostgreSQL wire protocol, so any BI software that can connect to Postgres can also connect directly to Molecula. Today, our customers use Molecula to power real-time visualization and BI tools like Tableau, Power BI, and Excel.

 

Complex WHERE clauses, counts, sorts, top-n, multi-field GROUP BY, JOIN, and any combination of these. It’s bad at processing transactions and doing anything that needs to access a single record rather than exposing data about sets of records or whole data sets.