Leveraging Bitmaps to Create a Feature-Oriented Data Format
Molecula FeatureBase: An Overview
FeatureBase is a feature-oriented database platform that makes an organization’s freshest data immediately accessible, actionable, and reusable. FeatureBase powers real-time analytics and machine learning applications by executing low-latency, high-throughput, and highly concurrent workloads simultaneously.
FeatureBase ingests data continuously to execute on computationally intensive analytical workloads in real-time for the front lines of your business. It allows you to ingest millions of events per second with ACID transactions while simultaneously analyzing, transforming, and aggregating billions of rows of data and maintaining efficiency.
FeatureBase stores data in a highly efficient way that’s quite different from how most databases store data. For example, when you think of a traditional table in a database, you probably imagine many rows representing records and many columns representing fields. Let’s take a table that details information about animals as an example. Figure 1 shows how this table might look in a traditional table in a database.
FeatureBase Difference #1: Data is Transposed
Now let’s look at that same data shown in the FeatureBase feature-oriented format (Figure 2). Here, each column represents an animal and has multiple values stored for each row that describes attributes about the animal. One column in a traditional database table may be stored as multiple rows in FeatureBase because each row in FeatureBase will represent a possible value from the column.
FeatureBase uses sharding with 64-bit keys for each record to both scale and distribute records evenly. There are no concerns about how tables will scale horizontally. FeatureBase stores each record and its associated feature values together on disk. This layout allows for FeatureBase to expand to over 18 quintillion records.
FeatureBase Difference #2: Our Feature-Oriented Format Stores Only Binary Values
The second big difference is that FeatureBase rows store only binary values using our feature-oriented format, and these binary values describe whether a relationship between two entities does or does not exist. For example, in our table about animals, we might represent the relationship between an animal and having wings. A value of “1” would be set if an animal has wings, and no value would be set if it does not. Our format represents this relationship amongst each attribute (has wings) and each record (animal). The “Winged” row in Figure 2 above represents this as a feature.
Let’s consider a portion of the table from Figure 1 in a traditional database: we have a column, or field, called “Primary_Movement” that details if the animal primarily moves by flying, swimming, or walking. Each row of data represents an animal, and each row will have one of the three string characters populated for “Primary_Movement” (Figure 3). We begin to see how inefficient this can be as one of those three strings is stored over and over again for every animal.
In FeatureBase, the “Primary_Movement” field would be composed of 3 bitmaps for each of the three values (Figure 4). Each record would only have to store a binary indicator of which relationship the animal has instead of the entire string. FeatureBase only has to set a bit of information if the relationship is true (“1”) and sets no bit if the relationship is false (“0”).
Across a large number of records, the FeatureBase feature-oriented format naturally results in massive storage savings. It makes computations on top of the data much more efficient by creating boolean relationships. Boolean (AND, OR, NOT, etc.) and aggregate (COUNT, SUM, etc.) operations are very performant on our feature-oriented format.
FeatureBase uses a Roaring B-tree format that compresses and optimizes computation even further, resulting in up to 100x more efficient computational workloads. Implementing these techniques means that aggregation and filter queries across hundreds of millions or even billions of records return with sub-second latency.
In addition, a much smaller footprint means far less data has to be written to disk. This reduced footprint is one reason FeatureBase can maintain a high ingest rate without sacrificing data freshness or sub-second query latency for a large number of concurrent users. The FeatureBase feature-oriented format allows users to ingest upward of a million records per second that are ready for consumption immediately.
FeatureBase: A More Efficient Database for Real-Time Analytics
FeatureBase’s feature-oriented format provides a lean and efficient solution that gives users the ability to generate insights on their data the second it enters their system without making costly tradeoffs for throughput, latency, or concurrency.
Our next blog in this series will discuss how FeatureBase’s feature-oriented format applies to integers and creates even further efficiencies for your data.
Interested in learning more about how FeatureBase can power your large-scale, real-time analytical workloads? Contact us now.