In this quick demo, watch how Solutions Engineering Lead Garrett Raska uses FeatureBase, the real-time database at any scale, to ingest and segment millions of customer records in milliseconds.
Hello, my name is Garrett Raska and I lead our solutions engineering team here at Molecula. Today we’ll be showing you working on Featurebase on a Jupyter Notebook. On a common use case, we see which is customer segmentation. So let’s go ahead and dive right in.
The first thing we want to do is set up the notebook to communicate with Featurebase. It is running on an EC2 instance in AWS. The node is 32 cores, and 128 gigs of RAM. And the first thing to do is import our Python library and connect through our gRPC API, which is hosted on 20101. See here. And we are going to be working on a table that was already ingested, in this case, the C SEC table, which is short for customer segmentation. And I’ll be showing you some of the data that contains here in a moment. Let’s go ahead and run this cell and set the notebook up and do a quick test query.
So as you can see, that query took around 14 milliseconds and did a quick SELECT COUNT (*) from the customer segmentation table to produce a COUNT that is 1,023,000,000 and some change. And that is the set we’ll be working with today. Now let’s take a closer look at one of the fields that we have. So you’ve got around 26 available fields, which you can think of as attributes or features. And each of those may contain a high degree or higher cardinality amounts of either strings, integers, etc. In this case, we went ahead and did a SELECT distinct from the hobbies field. And you can see we’ve got around 200 or so strings that can be represented on any number of the records, just to give you an idea. Now let’s go ahead and actually pull out with a SELECT (*) 10 of those records. And we’re going to load them into a panda’s data frame. As you can see, we have just general demographics information age, we have a bools field to represent how bools will be in Featurebase. City is a sting; education is an array; hobby is an array; income is an integer and so on.
Now let’s get to some of the use case. Oftentimes, we see customers unions who do very complex queries on large data sets. And they may be using AND, OR, NOT clauses and looking for results that return in the hundreds of thousands. And so again on this small node, we’re going to start to do a SELECT COUNT (*) from the customer segmentation table. And we’re going to search for a particular age, a specific income, and then we’re going to give it the ability to choose skills MS Office or from or the skill Excel so let’s go ahead and execute this. And the first thing you’ll notice still took around 215 milliseconds – a little high for us, but the return came out at 204,000 records meet these specific conditions. Now we can do this and continue to add clauses to this and the speed will remain the same.
Let’s go ahead and move on to some larger-scale aggregations. We typically focus here because aggregations or a sore spot in some of these use cases is the sheer number of records that are already present and continuing to flow in. Now we often see the need to do intermediary tables in the form of roll-ups or summaries. Those can take between two to 48 hours, to days, to a few weeks, in order to produce those roll-ups and to make them available for end-users. In this case, maybe BI tools such as visualizations, or even ML and data science teams. So as you can see, let’s go ahead and run a SELECT SUM on the income field for remember seen here contains all of the income integers present in the 1,023,000,000 records. So if I’m going to go ahead and run this as you can see, instead of just picking a few of them, I just sum to the entirety of the table. So all 1,023,000,000 were sons took around 23 milliseconds, you can see that that sum here is a rather large number. Now, you can do the same thing for AVERAGE as well just to give you an idea, so here’s the average seems like our tables doing quite well at around $100,000 average on income, etc.
Now lets up the stakes a bit. So let me flip over here. I’m logged in on the node and what I’m going to do is run a load script that is going to generate records at around 60,000 every few milliseconds or so and push those into the table live and we’re going to run those same queries. So let me kick that off. You can see we are creating, translating, and pushing those into the datastore live. Just to show you that, we can go back up to our first query which is the counts all and we should see this number jump. Yep, you can see it’s going to be around a million records or so ever so often. Yep, there we go. Now as you can see, that didn’t slow that query down. And we’ll do the same thing on our – let’s give it this complex condition. So again, load on this box is coming in. We’re adjusting at around a million records every one and a half seconds. And we can still tackle those large-scale aggregations without any delay or any added latency. So that one thing that we provide for our customers is throughput, the ability to again, ingest these records and run these queries simultaneously.
Let’s take a quick look. Our script is still running, still going in, and let’s just continue on. Another area that we help our customers is JOINS at scale. So your ability to do INNER JOINS across a billion tables to any other table can be quite difficult, especially with records flowing in. So let’s go ahead and run this while a record is still going. And what we’re doing here is a SELECT COUNT (*) and we’re going to start at our customer segmentation table. And then we’re going to INNER JOINS that to our skills table which has about 25,000 records in it. We’re going to look for specific conditions available for hire, and T1 hobbies. So we’re going to look for these on the skills table for anyone that’s available for hire, and we’re going to search through all of our 1 billion for those that are in the teaching field. As you can see careers still took around 26 milliseconds remember we are live ingesting at around a million and a half records per second, and you can see the COUNT return.
Let’s continue moving down. We also see the need for the ability to do grouping with large segmentations so that we can be considered a GROUPBY. So let’s go ahead and SELECT hobbies, we’ll COUNT (*) again from the customer segmentation table and really going to be interested in accounts that get above 2 million. So here because we get about 30 milliseconds and our hobbies come out to over let’s see how many 42 million and watchmaking, slacklining, watching movies again, GROUP BYs will return just as fast as our aggregations as well.
The last piece we see the need to load this information into visualizations. So let’s go ahead and run that because he will take that COUNT from hobbies and skills and represent it in this case on a top case, so here are the highest skills that we see across our entire table: data industry, entry phones, billing, etc. Thank you for your time.