JOINs With No Pre Aggregation

See Molecula’s FeatureBase in action executing a highly performant JOIN query on multiple massive datasets running on commodity hardware.

Explore the data yourself – Request a demo



This is Sarah King and I am Director of Product at Molecula. Today, I am going to share our extraordinary performance when executing JOINS across virtual data sources that represent disparate data sets. This is a follow-on to our previous video where Matt showed off millisecond exploratory and analytical queries across datasets ranging in size from 100 million records to 1.5 billion records. I’m in our Moleculan environment here, interfacing with the virtual data source manager, the VDSM using the built-in PostgreSQL client tool to query our virtual data sources. We’ll be using the same VDs’s that Matt was working with previously, the customers and items VDS’s.

So to grab a quick COUNT of those, the customer’s VDS has 100 million records in it and the items has 1.5 billion records in it. I’m going to execute some JOINs across these two data sets and for context, our customers tell us JOINs queries like these on this amount of data can take upwards of 12 hours to execute in their existing systems. So this first query that I am going to run is a basic JOINs query across the items and customers’ VDS’s to give an account of all of the customers that are in the U.S. region and have product #3094.

So there are 1848 customers who have product #3094 and are in the U.S. region and that query executed in 307 milliseconds. The next query I’ll run will do something similar but we’ll add a GROUPBY to make it a bit more complex so we are going to GROUPBY region and sales channel on the customer’s VDS and then we are going to FILTER that by only customers who have purchased product #3094. Then we’re also going to do an aggregation on the customer size fields. So that query executed in 923 milliseconds. So we just took queries that typically take over 12 hours and reduce them to sub-second latencies and Molecula does not use any pre-aggregation or pre-processing and this is all on a single node with 32 cores and 208 gigabytes of memory. Let’s just see… So it looks like only 74 gigabytes of memory is actually in use right now.

Thanks so much!