Is Molecula To Good to be True?

Analytical workloads have been evolving for quite some time and Molecula is building on two major shifts:

  1. Shifting from databases to data formats (e.g. columnar databases to Parquet and ORC) which can be handled flexibly and take advantage of serverless offerings more easily.
  2. Shifting from serialized data formats to in-memory formats (Parquet to Arrow). This is more nascent but will continue to have a massive impact on performance and flexibility—not needing to serialize and deserialize saves huge amounts of compute, decreases latency, and makes it far less painful to move data around a distributed system.

Molecula takes these two shifts, but takes the data format itself to the next level for analytics and machine learning which is the distributed bitmap/vector model (Pilosa). Instead of representing data record-by-record or column-by-column, we break it out even further to value-by-value. With Molecula’s approach, you get more compression, less I/O per query, and benefits around access control, versioning, a super GPU friendly data format, and the opportunity for easier homomorphic encryption given how simple the basic operations are. As a result, any type of query which is operating on much or most of a dataset is almost guaranteed to be far faster given the right algorithms because there’s so much less data to move from disk->memory or memory->CPU.