Enhance Snowflake for Real-Time Analytics with FeatureBase

What is Snowflake?

Snowflake is a cloud data warehouse that can run on Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. Fundamentally, Snowflake was born out of the need to migrate legacy data warehouses to the cloud for reduced data storage costs. Snowflake’s breakthrough in the data warehousing market was driven by the concept of separating storage and compute, which allows for greater pricing flexibility and pay-per-second billing. This pay-per-use model is a paradigm shift in how organizations typically interact with data warehouse vendors. 

More than a cloud data warehouse, Snowflake has evolved into an intelligent analytics platform that powers a variety of use cases. However, even with the numerous benefits Snowflake offers, delivering on real-time use cases remains out of reach.


Start Free Trial

 

Why Enable Real-Time Capabilities with Snowflake?

The definition of real time can vary depending on the business outcome or use case. This malleable definition can create problems if supporting data is not fresh, low-latency, and actionable. For example, when implementing solutions to optimize real-time customer experience, real-time means serving results within seconds, and more often milliseconds, to mission-critical applications with high concurrency. 

As organizations progress in analytical maturity, from reporting trends to predicting and prescribing real-time business decisions, they typically see Snowflake consumption costs increase. In addition, achieving real-time decisions may require other technologies and tools, many of which only operate on the data stored within Snowflake. These limitations often leave organizations unable to perform impactful real-time analytics due to Snowflake’s traditional RDBMS data structure and the resulting workarounds, like preaggregation, that seek to achieve low-latency at the cost of data freshness.

Reasons to consider enhancing Snowflake: 

  • Latency hinders analytics. Examples include complex queries bogging down performance, like those with unions, multiple JOINs, or concurrent users running queries simultaneously. In addition, lLag times between queries and results may take several minutes or even timeout and return no results at all. To work around system limitations and reduce this lag time, compute resources may need to be increased, or preaggregation tables may be developed and refreshed as new data is available. Unfortunately, this process can add minutes, hours, or even days between when data is generated and when it is available for analysis. 
  • Cost is impacted by the scale of data, the complexity of queries, performance requirements, and increased storage volume and processing of preaggregated copies. In addition to the computing costs on large volumes of data, scaling up the performance or adding additional technologies to overcome freshness and latency issues may further increase costs.

FeatureBase + Snowflake:

Snowflake is an extremely powerful tool for enabling business intelligence initiatives, but may benefit from specific enhancements due to its tendency to slow down when two or more are true:

  • Queries become too complex 
  • Queries are highly-concurrent 
  • Data ingestion volume is large
  • Computing on billions of records

Molecula’s FeatureBase solves these challenges. Molecula’s FeatureBase is a feature-oriented database service purpose-built for real-time analytics and machine learning. FeatureBase continuously extracts and updates features from Snowflake and other data sources without the need for staging or preaggregation. This superpower allows Snowflake to serve real-time applications and analytics more efficiently than it could alone. 

How they work together: 

  • Snowflake: Ingest all data to maintain Single Source of Truth (SSoT) and power analytics where low-latency is unnecessary.
  • FeatureBase: Ingest large volumes of structured and semi-structured data very quickly, like streaming sources and IoT device logs, via SQL or Change Data Capture (CDC), to serve real-time or low-latency applications.

FeatureBase + Snowflake enables organizations to graduate beyond human-scale queries and into real-time, high-concurrency queries for machine-scale analytics. 

FeatureBase + Snowflake Benefits:

  • Access the data you need at the moment you need it 
  • Improve data freshness by operating on your most up-to-date and relevant data, then store it within the Snowflake SSoT
  • Reduce latency and costs by computing on data in its most optimized format

 

Architecture:

Diagram illustrating the benefits of Snowflake and FeatureBase together

Fig. 1

Customer Reference:

We worked with our customer, AnSi Solutions, to compare benchmarks on three system configurations using de-identified data from one of their customers. At the time of benchmarking, our customer was using Snowflake to preaggregate billions of records for use with a custom application. The application was an interactive dashboard that allowed for role-based access to summarized, de-identified medical claims to monitor benefit plan performance. The customer aimed to achieve low-latency query performance that was not cumbersome or frustrating to the end-user. The large volume and complexity of the source data resulted in extra storage, compute, and labor costs to create and maintain preaggregated tables.

The customer tested sets of five commonly used queries on three different system configurations to estimate time savings (related to computing costs) and query accuracy.Snowflake with ‘light’ preaggregation Snowflake with ‘heavy’ preaggregation Snowflake + FeatureBase with no preaggregation benchmarks

Snowflake without FeatureBase on lightly preaggregated data took minutes to return results and was limited to the three agreed-upon reporting periods. If an ad hoc time slice was needed for reporting – e.g., getting one week of results instead of the 6-week, 13-week, and 6-month periods that are preaggregated – a new, preaggregated rollup table must be created, which would add 15-20 additional minutes to the initial query time. 

Snowflake without FeatureBase on heavily preaggregated data returned queries in tens of seconds – much better than the lightly preaggregated data, but still a frustrating user experience for a dashboard. It also required significant investment to create and maintain thousands of lines of code and did not allow for flexibility in ad hoc reporting. Any new reporting elements required changes to the preaggregation script which could take days, weeks, or longer to be requested, developed, tested, and deployed to production.

The Molecula team tested FeatureBase using commodity hardware with minimal tuning as a proof-of-concept for the customer. FeatureBase was demonstrated to be the most efficient option tested. Query results returned in seconds on source data while allowing for flexibility in calculating the reporting period and other ad hoc requests. Further, this was achieved with minimal data preparation and hardware usage for computing. Even faster query speeds on even bigger data are easy with FeatureBase. 

Benchmarks featuring Snowflake with preaggregated data vs. FeatureBase

Fig. 2

Learn more about how FeatureBase enables real-time analytics by speaking with one of our tech experts.

Start Free Trial