Rockset vs. FeatureBase

Rockset for Real-Time Analytics

Rockset is a cloud-native, serverless analytics solution that leverages RocksDB, an embeddable persistent key-value store, to serve real-time analytics to online applications and dashboards.

Rockset uses distributed architecture to spread query execution across multiple servers and is able to ingest, query, and update streaming data. It builds a row, columnar, and search index into every field that it calls a Converged Index™ to enable analytical filtering and aggregation queries. Rockset decouples resources needed for ingestion, storage, and querying using an Aggregator Leaf Tailer (ALT) architecture to scale each independently based on workload. 

Potential Pitfalls of Rockset

  • Requires costly overhead to support memory to store and compute on three or more indices that double data footprint.
  • Query latency can surpass 1 second when throughput requirements are over 12,000 events/second.
  • Unpredictable query performance may be unacceptable for the industry or use case
  • Leafing architecture is designed to make multiple copies of data, and the addition of new indices may lead to a higher data volume than expected. 

Another option is available to help you achieve your goals without making costly trade-offs.

Molecula FeatureBase

FeatureBase is a feature-oriented database platform that makes an organization’s freshest data immediately accessible, actionable, and reusable. FeatureBase powers real-time analytics and machine learning applications by executing low-latency, high-throughput, and highly concurrent workloads simultaneously. 

FeatureBase ingests data continuously to execute on computationally intensive analytical workloads in real-time for the front lines of your business. It allows you to ingest millions of events per second with ACID transactions while simultaneously analyzing, transforming, and aggregating billions of rows of data and maintaining efficiency.

Rockset vs. FeatureBase:

Rockset and FeatureBase have key technical differences, including data ingestion, query capabilities, data modeling, and the data format.  Let’s look at each.

Real-Time Data Ingestion

Rockset supports schemaless ingestion from multiple streaming, cloud storage, and database sources up to one billion events per day. Built-in connectors allow for initial load and continuous syncing of changes. Rockset is mutable, allowing for inserts, updates, and deletes of existing records. Field-level updates use a Patch API to reindex only the specific fields in the request. Additional CPU resources must be designated to support higher ingest rates while maintaining latency requirements – particularly when requirements include sub-second latency. 

FeatureBase seamlessly handles the ingestion of massive-scale streaming data, over one million records per second, simultaneously allowing for real-time inserts and updates to existing data schemas. While FeatureBase is not explicitly optimized for writes, it can scale out horizontally and employs several optimizations (like write-ahead logs) to support required throughput. Additionally, FeatureBase can do a lot of preprocessing on the client side so that users not only have the option to scale out the actual database servers but can offload much of the computation to ingest servers. These ingest servers can be ephemeral and exist only while there’s load. 

Query Capabilities

Rockset is a document database intended for analytical workloads that relies on multiple indices to speed up queries but will inherently reach CPU limits when attempting to scale. At some point, you start to pay for more than you can get in return. Rockset automatically stores three indices (the Converged Index™) with the option to store others for time series and geospatial analytics. Query latency may be volatile due to background compaction techniques that Rockset uses to compress data footprint. Storing and searching through multiple indices requires a complex architecture.

FeatureBase also excels at analytical workloads, but it is built around a feature-oriented format that allows for a few step-function improvements in performance reliability, while also reducing hardware footprint up to 90x. As a result, it is extremely good at supporting live updates while maintaining low-latency queries. FeatureBase can collapse multiple tables into single entities and allow for multiple values within single fields. This eliminates the need for data preaggregations allowing organizations to operate on their freshest data while maintaining ultra low latency. 

FeatureBase’s novel approach to data minimizes I/O on queries by allowing the database engine to read and write exactly the data it needs and intelligently compress that data in memory. 

Data Format

Rockset automatically creates three indices for each field in a stored document – an inverted index for point lookups, a columnar index for aggregations, and a row index for data retrieval. Users have the option to create a range index or specialized geo-indexes for time series and geospatial analytics. Each index is built to speed up query latency. 

Storing all of these indices means that your data storage size is larger than the initial data size that is loaded. After the three initial indices are created and compressed in RocksDB, the stored data is typically double (or more!) the size of the source data. When data is stored in RocksDB, Rockset uses multiple compression technologies to combat storage amplification that can occur from having three or more copies of each field. These compression techniques can make write and query performance unpredictable.

FeatureBase is built entirely on our proprietary feature-oriented format. The beauty of the feature-oriented format is that it takes all of the benefits that column-oriented formats provide over row-oriented formats and actually optimizes them even further, resulting in up to 100X price-performance reduction over column-oriented databases. For example, while column-oriented databases allow only relevant columns to be scanned to answer queries (instead of scanning every single row), FeatureBase takes that even further, requiring only the specific value to be scanned – without the need to store multiple indices for cross-reference.

FeatureBase utilizes homomorphic compression which allows us to read and write from compressed data without having to decompress it. Further, our approach automatically optimizes compression based on detecting varying data sparseness or density distributed throughout the dataset.

If this is a bit confusing (it’s a new concept, after all!), it can be easiest to understand our feature-oriented format with an example. Let’s say we’re trying to count the number of people wearing green shirts. 

  • In a row-oriented database, the database would have to scan the entire “People” table row-by-row to find everywhere “Shirt Color” occurs and count each occurrence of “Green.” 
  • In a column-oriented database, it would have to scan the entire “Shirt Color” column to count all of the instances of “Green,” but it would be able to ignore everything else in the table. 
  • In FeatureBase, the database can go directly to the feature “Green Shirt Color” to count the bits set in a compressed bitmap for that feature. This means the database only has to deal with the data that tells it whether a person is wearing a green shirt or not. It can ignore everything else (including other shirt color options!).

Data Modeling

Rockset uses a schemaless, object-based data model. This type of data model is flexible and requires no upfront data mapping. Following schemaless ingest, Rockset uses Leaf-Tailer aggregation techniques to index data. To combine historical and streaming data, multiple data collections can be preaggregated. 

FeatureBase models data in a novel way. Tables are typically modeled around entities (customers, patients, unique IDs, etc.) or events (transactions, etc.). In addition, tables can have multiple sources (batch and streaming) and update records or add new fields in real-time.

Mapping relational tables to FeatureBase can be as simple as a one-to-one mapping, but significant performance improvements can be made through the mapping and feature table structure depending on:

  • the expected query workload
  • the type, size, and cardinality of the data
  • the cost requirement

FeatureBase’s data model, combined with the performance benefits of the feature-oriented format, results in the elimination of data preaggregation, meaning organizations are able to operate on their freshest data while maintaining ultra low latency.

When to Choose FeatureBase Over Rockset

Molecula’s FeatureBase is a feature-oriented database platform service purpose-built for real-time analytics and machine learning. FeatureBase continuously extracts and updates features from streaming technologies like Kafka and other data sources and can maintain sub-second latency on complex queries at scale with dramatically reduced computational costs and without the need for multiple indexes or copies of data. This is a crucial differentiator when deciding whether your organization would be better suited to use Rockset or FeatureBase. 

 

If your organization is looking for consistent, sub-second query performance without sacrificing concurrency or throughput, you may struggle with Rockset. Additionally, if your organization is facing challenges with an ever-expanding data footprint and difficulty managing data copies, Rockset will compound your problem.

 

Start Free Trial