Modeling Data In Molecula’s Feature Store to Improve Efficiencies
The Molecula feature store is optimized for both analytical workloads and statistical computation. This multi-part optimization requires a set of considerations that a traditional relational database might not necessitate, but are able to create deeply enhanced performance. Mapping relational tables to Molecula feature tables can be as simple as a one-to-one mapping. Major performance improvements can be made through the mapping and feature table structure depending on:
- the expected query workload
- the type, size, and cardinality of the data
- the cost requirements
The most simple starting point is a one-to-one mapping. There are some basic rules we can apply to reduce the number of feature tables while enhancing query capabilities. As an example, let’s say part of our schema consists of four tables that look like the below:
Both “Customers” and “Suppliers” have a one-to-many relationship with “Nation.” Assuming that, for analytical purposes, “Nation” and “Region” do not change often, it makes sense to denormalize them into the feature tables for “Customers” and “Suppliers.” Once we do this, the feature tables now look like so:
Now, let’s dive deeper into these denormalized feature tables. “Phone” is likely not a field we care about for analytical purposes (i.e. we would rarely need to ask “show the phone numbers that have the most customers associated with them”), so it does not make sense to include. On the other hand, “show me the nations with the most customers” will provide interesting insights.
The “Comment” field may or may not be something useful we want to query on. There are three ways to handle this:
- We could omit this field altogether
- If it’s a well-structured field with low to medium cardinality, it can be indexed directly
- We might do some preprocessing to extract features we care about before ingesting it. For example, maybe supplier comments include “history of delivering early” which we want to include as a boolean field that we can query on.
After removing “Phone” and “Comment,” and adding a custom field (“Delivers Early”), our schema now looks like this:
Similar to “Phone,” for this specific instance, we can leave “Name” or store the field as an attribute. In Molecula, an attribute is something that can be returned at the end of a query with the results, but can not be queried on directly (e.g. counting the number of customers with a certain name). As an example, email addresses are commonly stored as an attribute to be returned with query results.
“Address” is the most interesting field to approach – what will likely be most useful is to parse it out and store “City,” “State,” and “Zip Code” as separate fields. The street name and number are less relevant for analytics purposes and, therefore, can be dropped.
Our feature tables now look like:
This is starting to look quite good for solving analytics and statistical computation use cases, but let’s add some more complexity to our schema.
What if there is a table of “Parts,” and suppliers have a many-to-many relationship with parts where each supplier might have a different quantity and cost for each part. In a relational database, that might be represented as follows:
In Molecula, if “PartSupplier” were a simple many-to-many instance that said which suppliers had which parts, we could represent it solely as a multi-valued set field called “Parts” on the “Suppliers” feature table like so:
Set fields are a special power of Molecula. We essentially allow a field to be multi-valued. Through this, one could easily query for all suppliers who could supply a certain set of parts without querying the “Parts” or “PartSupplier” feature table. If “PartSupplier” were a simple many-to-many (or if we were not concerned with “AvailQty,” “SupplyCost,” or “Comment” in the above example), we’re able to drop it entirely.
Similarly, one might add a “Suppliers” set field to the Parts feature table in order to be able to easily determine which parts were available from a subset of suppliers. If we do care about the extra data in the “PartSupplier” feature table, we could drop the “Name” and “Comment” fields (similar to what we did with “Customers” and “Suppliers” in the previous example flow we walked through) — our new set of feature tables would look like this:
Still with me? Finally, let’s add in some higher cardinality data to round out the use case. If you’re not familiar, this schema that we’re using is actually from the TPC-H benchmark which is an industry-standard benchmark for analytical and business decision support workloads. We’ll add the final two tables from that schema, which are “Orders” and “LineItem” and the result is the below:
Likely, these two tables would remain largely the same and translate directly to feature tables, with the now-familiar step of removing “Comment”, and perhaps “ShipInstruct” (assuming it’s free-form text).
Now, of course we made assumptions for this exercise — there is always more detail to uncover, and each customer use case varies. The main takeaway, though, is that Molecula allows customers to begin with a one-to-one mapping using relational tables and refining as needed in order to achieve target cost, recency, and query performance. Molecula’s high-performance feature storage solution offers unique capabilities with multi-valued fields that can result in unbelievable performance for otherwise difficult queries.
Interested in learning more about how Molecula can power your use case? Schedule a demo now.