Is FeatureBase Too Good to be True?

Analytical workloads have been evolving for quite some time and FeatureBase is building on two major shifts:

  1. Shifting from databases to data formats (e.g. columnar databases to Parquet and ORC) which can be handled flexibly and take advantage of serverless offerings more easily.
  2. Shifting from serialized data formats to in-memory formats (Parquet to Arrow). This is more nascent but will continue to have a massive impact on performance and flexibility—not needing to serialize and deserialize saves huge amounts of compute, decreases latency, and makes it far less painful to move data around a distributed system.

FeatureBase takes these two shifts, but takes the data format itself to the next level for analytics and machine learning which is the distributed bitmap/vector model. Instead of representing data record-by-record or column-by-column, we break it out even further to value-by-value. With FeatureBase’s approach, you get more compression and less I/O per query. As a result, any type of query which is operating on much or most of a dataset is almost guaranteed to be far faster given the right algorithms because there’s so much less data to move from disk->memory or memory->CPU.