Is FeatureBase Too Good to be True?
Analytical workloads have been evolving for quite some time and FeatureBase is building on two major shifts:
- Shifting from databases to data formats (e.g. columnar databases to Parquet and ORC) which can be handled flexibly and take advantage of serverless offerings more easily.
- Shifting from serialized data formats to in-memory formats (Parquet to Arrow). This is more nascent but will continue to have a massive impact on performance and flexibility—not needing to serialize and deserialize saves huge amounts of compute, decreases latency, and makes it far less painful to move data around a distributed system.
FeatureBase takes these two shifts, but takes the data format itself to the next level for analytics and machine learning which is the distributed bitmap/vector model. Instead of representing data record-by-record or column-by-column, we break it out even further to value-by-value. With FeatureBase’s approach, you get more compression and less I/O per query. As a result, any type of query which is operating on much or most of a dataset is almost guaranteed to be far faster given the right algorithms because there’s so much less data to move from disk->memory or memory->CPU.