How FeatureBase Powers Blazing-Fast Queries

(Hint: It’s Not Caching)

 

If you’re reading this, you’re likely familiar with the evolution of the database. But as a quick refresher, we started with row-oriented databases, which are well-suited for OLTP (Online Transactional Processing) workloads, and are optimized for retrieving the complete information for a record as quickly as possible. Here are a few examples of OLTP-type queries:

SELECT PURCHASEAMOUNT AS AMT FROM TABLE WHERE ID=123;

SELECT FIRST_NAME, LAST_NAME FROM TABLE WHERE TRXN=”POS-123”;

OLTP is great, but as businesses started asking new questions of their data, the database evolved, and column-oriented databases were created for OLAP (Online Analytical Processing) workloads. A few examples of OLAP-type queries are operations like:

SELECT SUM(SALARY) FROM TABLE;

SELECT AVG(SALARY) FROM TABLE;

SELECT COUNT(*) FROM T WHERE SALARY>5000 AND SALARY<10000;

But even column-oriented databases require CPU to scan each relevant column of data in order to answer a question. At Molecula, our engineers looked at column-oriented databases and the I/O they incur, and then analyzed how data could be adjusted to minimize I/O as much as humanly possible. This exploration resulted in FeatureBase and our feature-oriented data format, upon which it is built.

So why is FeatureBase even faster than column-oriented databases when it comes to certain query types? Follow along for a few reasons and an exploration into a couple of specific query operations that FeatureBase excels at (plus why it excels at them).

Let’s Start with a COUNT…

Starting with a simple COUNT query, we can immediately begin to see why execution within FeatureBase is extremely fast, and is such without using atavistic methods like caching, preprocessing, or any other preemptive strategies. 

It’s all about Compute Efficiency 

Three components within FeatureBase’s design make it very easy to return millisecond query results, even across billions (or trillions!) of records: 

  1. Storage Format 
  2. In-memory (Fast Fetching) 
  3. Automatic Decentralized Sharding

Storage Format 

Most data practitioners are familiar with the column-oriented format of storing data that we discussed earlier (used by Redshift, BigQuery, Clickhouse, etc.). 

Let’s imagine that in the below color-coded data example (representing a column-oriented format), we want to count the number of blue rectangles in the dataset. In a column-oriented database, the values of a column are physically grouped together to speed up access. However, you still need to scan through each value within the grouped data. When stored, the table below would look like this: “Blue”, “Yellow”, “Blue”, “Yellow”, “Blue”, “Green”, “Blue”  (NB: the actual strings are removed for color code to simplify)

A COUNT of Blue requires loading the entire column structure and only counting the “Blue” elements.

columnar type table

Now, let’s say we want to COUNT only the color blue within the dataset in FeatureBase. Due to the efficiency of FeatureBase’s format, we do not need to go through all of the possible elements within that column, and encode each element before it moves to computation. Looking at the figure below, you can see the data has been decoupled with unique values (e.g. blue, green, yellow) stored separately from the presence of those values in a record (e.g. yes/1, no/0), allowing for fast comparisons regardless of content size.

This typically provides a 5-10x reduction in storage space, and while not the focus of this article, this compression helps FeatureBase perform these queries quickly – more on this in a bit (ha…get it?).  With this example we can simply count the bits set for that specific blue color, ignoring all other possible colors. Additionally, once the data is actually being processed by the CPU, most of the processing is in the form of bitwise operations on compressed bitmaps doing highly vectorizable operations. Thus every CPU operation is able to take advantage of wider instructions and process more data at once.

FeatureBase type table

In other words, the data is structured in such a way that FeatureBase uses a minimum amount of I/O to process analytical queries because it can granularly address the individual elements of interest and the data is able to move into compute readily, without waste. 

In-Memory = Fast 

In-memory representations of data have a distinct advantage: at the time of query execution, data can be fetched into CPU extremely quickly from RAM, shaving processing time and providing lower latency results. This is not a new concept, however it is an accretive element in removing bottlenecks from query submission to response time. The alternative, reading from disk, can slow down this response time considerably, making RAM a much faster option. This does put a limitation on “hot” data within RAM, but remember from the previous section that FeatureBase generally compresses data on the order of 5-10x during ingestion. Additionally, FeatureBase uses memory mapped files, allowing writes to be persisted to disk to facilitate extremely large datasets (trillions of records). As compared to columnar data stores, these 2 factors allow FeatureBase to operate on much larger in-memory tables with reduced hardware footprints, all while keeping latency low. 

Query Execution 

As you have seen with other attributes of FeatureBase, the underlying factor that increases performance is its efficiency. When data is ingested into FeatureBase it is automatically sharded, or split into pieces, and distributed amongst the nodes in a cluster. This has many benefits, a primary one being easy parallelization to capitalize on CPU (vCPU) available. 

When a query is submitted, rather than processing all the data in a contiguous block, FeatureBase uses a Map/Reduce approach to split the workload up and ensure CPU saturation. An intelligent worker/queue system allows this process to run smoothly and has a dramatic effect on query latency especially within a cluster where many CPUs are available. This sharding or partitioning strategy can be found in other solutions, however FeatureBase does this automatically and ensures balance across all nodes to maximize performance. 

Wait…But What About COUNT?

Don’t worry, we haven’t forgotten the original topic! Let’s put this all together and revisit the COUNT. For this example, we’ll assume the data is ingested from a common format, let’s say JSON:

[ { “ID”:[1,4,5] , “Color”:“Blue”} ] 

This small example is placed in FeatureBase’s feature-oriented format: 

*Note: During normal operation, each Shard width is just over 1 million bits in length, but for this example we’re using just 7! 

FeatureBase Shard

Now, a query is submitted to count all the “Blue.” Because of FeatureBase’s format, the data is all in a feature-oriented format (built on top of bitmaps) and readily in-memory, so it’s now a matter of maximizing CPU. This bitmap has already been split up across 3 shards as shown, so parallelizing is taken care of by sending each shard to its own thread. Also, since one of the shards, Shard 3, has no bits set it’s excluded entirely – more FeatureBase efficiency. Once each CPU’s results are ready, they are “reduced” to comprise the final result, the count of 3. 

Scaling this up to real-world numbers, this operation can take place across billions of records with a response time of <15 milliseconds without breaking the bank on expensive compute. The data can also include very sparse relationships amongst the records, which would typically make this shape difficult to work with in other columnar and relational databases, but does not have a detrimental impact on FeatureBase’s latency. 

Applying the above concepts, and some other fancy efficiency work we’ll dive into later, FeatureBase is able to perform extremely fast summations (SUM), and other common arithmetic and range operations such as MAX and MIN calls. All of these return with the same millisecond response times, as the format and compute efficiency make quick work of these particular query patterns. The largest benefits are seen in datasets with sparse associations and over billions of records. This cumulative efficiency and design leads to much smaller footprint requirements versus other databases.

Want to see it for yourself? Give FeatureBase a go (no credit card required)!

Start Free Trial