Is Molecula Too Good to be True?
Analytical workloads have been evolving for quite some time and Molecula is building on two major shifts:
- Shifting from databases to data formats (e.g. columnar databases to Parquet and ORC) which can be handled flexibly and take advantage of serverless offerings more easily.
- Shifting from serialized data formats to in-memory formats (Parquet to Arrow). This is more nascent but will continue to have a massive impact on performance and flexibility—not needing to serialize and deserialize saves huge amounts of compute, decreases latency, and makes it far less painful to move data around a distributed system.
Molecula takes these two shifts, but takes the data format itself to the next level for analytics and machine learning which is the distributed bitmap/vector model (Pilosa). Instead of representing data record-by-record or column-by-column, we break it out even further to value-by-value. With Molecula’s approach, you get more compression, less I/O per query, and benefits around access control, versioning, a super GPU friendly data format, and the opportunity for easier homomorphic encryption given how simple the basic operations are. As a result, any type of query which is operating on much or most of a dataset is almost guaranteed to be far faster given the right algorithms because there’s so much less data to move from disk->memory or memory->CPU.