See Molecula’s FeatureBase in action performing millisecond analytical queries on massive datasets running on commodity hardware. Nested queries, aggregations, complex filters, sorted lists, Top-Ns, etc.
This is Matt Jaffee and I’m the Director of Engineering at Molecula. Today, I’m going to share our incredible real-time query performance that enterprises are taking advantage of to break through latency barriers and enable new use cases. For this demonstration, I’ll be focusing on basic connectivity and data exploration. We have our Molecula environment here and I’m going to be interfacing with the virtual data source manager, the VDSM, which is effectively the API layer to Molecula. We use it to query the underlying VDS’s. Our VDSM speaks the Postgres wire protocol which allows anything with Postgres connectivity to interact with Molecula. This demo, I’ll be using PSQL which is just the standard Postgres command-line client tool that you get when you install PostgreSQL. We’re using this not to connect to Postgres but to connect to Molecula’s VDSM.
The first thing I’m going to do is use a PSQL command that turns on timing for all queries. Let’s take a look at the data that we have in this instance of Molecula. Show tables command will list out all the VDS’s and this shows some synthetic data that we’ve generated that mimics many of our customer environments. We have data about customers, items they may have purchased, warranties they have bought on those items, and claims they may have made on those warranties, as well as marketing data. Let’s first take a look at the customer’s VDS. We can SELECT * from customers just to get an idea of what fields are there and what values they might have. We can see a region, a sales channel, and a size field on the customer’s VDS. How many customers are there? The basic SQL query will tell us there are 100 million customers in this synthetics data set. If we wanted to do something a little bit more complex to get a feel for how the data is laid out and what it looks like we could maybe dig into the region field and do a top-end query. Top-end is just a shortcut for doing a GROUPBY ordered by COUNT. You can see it returns instantly and gives us a breakdown of the region field which we could use to build a histogram. The equivalent SQL is here and also returns instantly. We can do other types of things like maybe getting the maximum size from the customer’s index: so SELECTing a maximum across 100 million customers.
Moving on to the items VDS, we can check out its schema. Each item has a cost, a customer ID, a product line, a product type, and a ship date: so actually more fields and items. Let’s see how many records there are. There are actually 1.5 billion records in the items VDS so about 15 times more. Let’s do a top-end query on items and see how that performs. The top-end query on the product line field of items shows there are 51 different product lines and that took 53 milliseconds. You can start to see where you can do some nice data exploration and really in real-time. Let’s do one more query that’s a little more complicated; it will SELECT region and sales channel and get the COUNT of each combination of region and sales channel and also SUM the size field on the customer’s VDS but will do it only for customers that have a size less than 1000 and we’ll just get the first 100 results.
Again, you can see those calculators happen incredibly fast. You may at this point be questioning whether things have been pre-computed and I can assure you that they have not. This is all running on a single VM in Google Cloud with 32 cores and about 200 gigs of RAM. It looks like about 66 gigabytes are in use right now by the Molecula infrastructure.
That’s all for today. Thank you!