Welcome to the FeatureBase Sandbox!

Let’s Get Started…

Follow the guide outlined below for a hands-on demonstration of low-latency queries at scale using our FeatureBase SaaS platform. As you work through the guide, please note any questions or feedback that you may have for the Molecula team. We’re always looking for ways to improve the experience! 

In this demonstration you will: 

  1. Sign up to access FeatureBase
  2. Create a new FeatureBase deployment
  3. Restore two large-scale datasets into the deployment 
  4. Run a set of analytics queries

If you run into any roadblocks or have questions throughout the demonstration, please reach out to [email protected]

 

Sign-Up Overview

Please click on the invitation to sign up link in your email. You’ll be taken to the sign-up page. You will see your email pre-filled in the application:

 

fbrunbook1

 

You will be instructed to create a password with instant feedback based on password requirements:

 

fbrunbook2

 

After creating a password, a verify email screen will appear. A confirmation code will be sent to your email. Please enter that code to complete sign up. If you don’t complete this step, this screen will also be shown again the next time you attempt to log in.

 

Verify Email

 

Now that you are signed up, you will have access to FeatureBase for seven days. After seven days, you will no longer be able to log in with your credentials. If you encounter any problems during this process or would like to reactivate your account contact [email protected].

 

welcome screen

 

Configuring your environment

In order to use our application, you’ll need data. In a real-life situation, the Molecula team will provide guided onboarding and data modeling for our organization’s data. In this exercise, we’ll be working with curated demo data to showcase the low-latency capabilities of FeatureBase. Instructions to run the script that will deploy FeatureBase and load the demo data are laid out below.

Instructions for MAC:

  1. Download the “Molecula_Sandbox.sh” script attached in your email or found here to the “Downloads” folder in your local environment.
  2. Open the “Terminal” application that exists on your mac.
    • You can search for it by clicking the magnifying glass in the top right corner and  typing “terminal.” It looks like this. fbrunbook4
  3. Complete steps 4-7 by copying commands and pasting them into the terminal window.
  4. Navigate to where the script was downloaded by typing or pasting this command in Terminal and pressing ‘ENTER’ 
    • cd Downloads
  5. Verify you can see the Molecula_Sandbox.sh file by typing the command below and pressing enter:
    • ls -lrt
  6. In order to run the script, you must first type the command below in Terminal and press enter:
    • chmod +x Molecula_Sandbox.sh
  7. Run the script by typing the command below in Terminal and press enter:
    • zsh Molecula_Sandbox.sh
    • Enter Username (email)
    • Enter Password (created during UI sign up)
      • NOTE: this will not show up visibly in the terminal, but it’s there! This to protect your information.
  8. Navigate back to the web application (https://app.molecula.cloud/) to check on your deployment
  9. The script is now configuring your environment and loading data, which will take approximately 30 minutes total.

This is a great time to grab a cup of coffee or reply to all those waiting Slack messages! A new deployment is spinning up and over 1B records are loading.

While the deployment is spinning up, you will see messages in Terminal and in Cloud Manager as the status progresses. The status will show as ‘CREATING’ while the deployment is spinning up and  ‘RUNNING’ when the process is done.

deployment status

 

Deployment Status RUNNING

 

Once the deployment is ‘RUNNING’ data will start loading into the deployment. After about 30 minutes, check the “Tables” section of the application to see the two tables that have been restored into the deployment. One table is called ‘cseg’, short for customer segmentation, and the other is called ‘skills’. A description of each table is also listed. In the next section, we will perform a variety of common analytical queries on both datasets.

 

Tables restored

 

Introduction to SaaS Environment and Test Dataset

In a real-world situation, a Molecula expert will work with your organization to appropriately model data for your use cases in FeatureBase. To give you an idea of the process, this tutorial will walk you through how to interact with complex data in FeatureBase using data from the Customer Segmentation (cseg) dataset. 

Data Exploration of Customer Segmentation Feature Table

It’s always a good idea to understand what the dataset you’re working with contains before you get started. To do this, click on the ‘Tables’ section in the application to display the names of the tables in your deployment.

 

Tables restored

 

Click on a table to show its contents. FeatureBase can ingest and represent a wide range of field types. One that may not be familiar is the SET field. SET fields are multi-valued and allow FeatureBase to collapse traditional data models, like the star schema, by efficiently storing multiple values for a single field.

 

tables

 

To understand the shape of the data contained in the customer dataset that was preloaded into this environment. First, navigate to the Query page by clicking “Query” on the left navigation bar.

Let’s start by running a simple SQL statement to extract 10 records to explore. 

SELECT * FROM cseg LIMIT 10;

Viewing this tabular output we can see each record contains several fields (attributes) and data types. Scroll left and right in the application to explore the full list of fields. For example, names and cities are captured in STRING fields. Income is captured in an INT field that will allow for range queries. You can also see that “education” is a SET field with multiple values in a single field.

 

Select Top 10

 

Next, let’s check the full scale of this dataset by using another familiar SQL statement, the COUNT

SELECT COUNT(*) FROM cseg;

This query will return the COUNT of records in the entire table and demonstrates we are working with a dataset of 1 billion records. Each record in the cseg table has 16 attributes.

 

Select Count Star

 

Performing Large Aggregations

Aggregation workflows often require the ability to SUM large amounts of individual INT elements. This could be transaction amounts such as dollars (decimals), whole integers (counts, bandwidth), or any variation requiring a SUM across many records. In this example, we will SUM the income field across all 1 billion records. 

SELECT SUM(income) FROM cseg;

 

Select Sum Income

 

It is unlikely to need to SUM in this manner across all records. It is much more common to introduce complex conditions to SUM a segment of records that meets specified criteria.

Here we introduce comparative and logical operators including GREATER THAN, AND, and OR

SELECT SUM(income) FROM cseg WHERE income > 5000 AND age = 45 AND skills='Ms Office' OR skills='Excel';

As you can see, the latency is in the sub-second time frame even when using complex searching criteria through 1 billion records.

 

select sum income with filters

 

When aggregating over a SET field, values for a record will be included in multiple groups if not excluded in the query. For example, when SUM(income) is used with a GROUP BY of “education”, income for a record with both ‘Bachelor’s degree’ and ‘High school diploma or GED’ will be included in both groups. It is common for a single person to have multiple values for a field that may seem contradictory or redundant, like “education”. This may be due to differences in status over time as data are collected and aggregated. A person may be categorized as having “education” status of “Some college” and later be categorized as having a “Bachelor’s degree”. When those two data sources are matched up, the person may have multiple values associated with them.

Additionally, aggregations may include the AVERAGE argument. 

SELECT AVG(income) FROM cseg;

 

Select avg income

 

INNER JOINs at Scale 

FeatureBase can merge at ingest and eliminate preprocessing in cases where performant INNER JOINs are required. Data from two separate tables or sources can be merged into a single normalized table by matching on a unique key in each dataset. Since FeatureBase can execute queries very quickly, workflows requiring INNER JOINs can be simplified with FeatureBase by merging disparate datasets at ingest. In the following example, we are combining many of the queries in this guide and adding the INNER JOIN

The INNER JOIN is facilitating a COUNT of records, or people, are “available for hire” as indicated by  that have a string field true for “available_for_hire” located in the skills table, and have the string field true for “Teaching”. In other words, we would like to count the number of people who are teachers and also available for hire. The latency on this type of INNER JOIN at the billion records scale is still sub-second allowing for several interesting data models.

SELECT count(*) AS count FROM cseg AS t1 INNER JOIN skills AS t2 ON t1._id = t2.id WHERE t2.bools = 'available_for_hire' and t1.hobbies = 'Teaching';

 

Inner join at scale

 

Included in the stock dataset is a table known as skills, please use the discovery queries to take a look at it!

Grouping with Complex Conditions and Aggregating 

Another query commonly seen in aggregation-related use cases is the GROUP BY. For example, let’s group by the hobbies counting only those with ultimate count above 200,000,000.

SELECT hobbies, COUNT(*) FROM cseg GROUP BY hobbies HAVING count> 200000000;

 

Select hobbies count star having

 

Another useful facet of GROUP BY is the ability to add an aggregate argument and utilize the low latency aggregation in another capacity. 

SELECT education, SUM(income) FROM cseg WHERE age=18 GROUP BY education;

 

Select education sum income where

 

TopK – A FeatureBase Superpower

Ranking queries are notorious for being computationally intensive – aka slow. Some solutions will use statistics to speed up a ranking query by approximating the true results, but that’s not always a desirable option. In addition to SQL, FeatureBase has a native language called Pilosa Query Language (PQL). In PQL, TopK queries can be run to return exact results in milliseconds. 

This query returns the top five hobbies across all customers from the cseg table, sifting through a billion records in 117.2ms. 

[cseg]TopK(hobbies, k=5)

 

Topk 5 hobbies

 

More complex, the next query returns the top ten hobbies among females who also like scuba diving from the cseg table in milliseconds. Even when adding complex filtering, the TopK queries can be run for exact results at scale without impacting query latency. 

[cseg]TopK(hobbies, k=10, filter=Intersect(Row(sex=Female),Row(hobbies='Scuba Diving')))

 

Topk 10 hobbies scuba diving

 

At this point, we encourage you to mix and match segmentation criteria to experience low-latency queries even as complex conditions are added.

Note that we don’t currently support full SQL, but are working toward expanding SQL functionality. For example, the AVERAGE function is not currently supported in GROUP BY queries and will be added soon.

If you have issues with your queries, please contact Molecula Representative. FeatureBase has a native language, called PQL, and we can help you translate your SQL queries to get the desired results.

What’s Next?

We hope that this hands-on experience has further demonstrated the power of FeatureBase to power real-time analytics workflows at scale. While this example focused on a customer segmentation use case, the same type of workflows are often used in anomaly detection or business process optimization use cases and continues to perform as workloads grow to trillions of records. Additionally, FeatureBase excels at combining streaming and historical data in real-time, allowing you to analyze data as soon at is available in FeatureBase with no need for time-consuming preprocessing or preaggregation. From here, partner with your Molecula representative to better understand how FeatureBase will work for your organization’s specific needs.

 

Queries

Data Exploration 

SELECT * FROM cseg LIMIT 10;

SELECT COUNT(*) FROM cseg;

Complex Segmentation 

SELECT COUNT(*) FROM cseg WHERE age = 45 AND income>50000 AND skills='Ms Office' OR skills='Excel';

Aggregations

SELECT SUM(income) FROM cseg;

SELECT SUM(income) FROM cseg where income > 5000;

SELECT AVG(income) FROM cseg;

JOINS

SELECT COUNT(*) as count from cseg AS t1 INNER JOIN skills AS t2 ON t1._id = t2.id WHERE t2.bools = 'available_for_hire' and t1.hobbies = 'Teaching';

Grouping with Complex Conditions

SELECT hobbies, count(*) from cseg group by hobbies having count > 200000000;

TOPK 

[cseg]TopK(hobbies, k=5)

[cseg]TopK(hobbies, k=10, filter=Intersect(Row(sex=Female),Row(hobbies='Scuba Diving')))

 

Spinning Down a Deployment

When you have completed the trial, please take a few minutes to spin down your deployment. If you do not spin it down, it will be done after 7 days by the Molecula team. 

First, delete the tables in the deployment in the ‘Tables’ tab. Click the three dots and select ‘Delete’

 

delete tables

 

Confirm the deletion by typing ‘DELETE’ into the interface. Repeat for each table in the deployment.

 

delete tables 2

 

Repeat the process in the Cloud Manager for each deployment that you want to spin down. It takes a few minutes longer to delete a deployment than it does for a typical table.

 

delete deployment 1

delete deployment 2

 

Appendix A: Sign Up Troubleshooting

You may request a new code be sent if you deleted or can’t find that email.

 

Verify email request code

 

Once you have created your account, the invite link will no longer be valid, but will allow you to go to the sign-in page:

 

Invite Link Used

 

If you don’t sign up before your invitation expires, you will see the following screen. Please contact [email protected].

 

invite expired

 

Please contact [email protected] if you see the screen below at any time.

 

oops