The 4 Key Requirements of Real-Time Analytics: Latency, Fresh Data, Throughput, and Concurrency
Despite advances in big data and analytics, organizations still struggle with the complexity currently involved in building large-scale, real-time data applications. A typical big data architecture for a single use case might involve 6 to 20+ tools to cover ingestion, storage, transformation, observability, and analytics – and the process is not repeatable. These tools were designed to deal with structured reporting schedules that informed subsequent actions, and many are built for a batch-oriented world.
The rise of event streaming has brought real-time analytics one step closer to fruition, but streaming data requires new technology to take full advantage of its capabilities. In order to power machine learning and AI workloads, streaming data must be combined with historical data in real-time, and data from source applications should not require preaggregation or copying to achieve latency and concurrency. The modern data stack is primed for innovation as streaming data quickly becomes the status quo.
Developers of high-performance applications strive to deliver real-time analytics on massive data sets, but to do this, they must remain efficient while balancing four requirements:
- High Throughput: instantly ingest massive volumes of data
- Fresh Data: immediately act on data as it’s ingested
- Low Latency: millisecond queries
- High Concurrency: thousands of simultaneous queries
Now, let’s explore each of the four requirements to real-time analytics and how FeatureBase makes it possible to achieve all four without compromise.
Why are even the most modern data stacks still struggling to deliver real time?
Most tools on the market excel at one or two of these four requirements – maybe even three, but covering all four requirements is incredibly challenging without spending a LOT of money. This is one reason implementing real-time analytics at scale is proving so difficult. Let’s take a look into each of these four requirements and why they are needed to deliver real-time data at scale (Figure 1).
Key Requirement #1: High Throughput
When we discuss throughput, we’re referring to data ingest.
In 2021, it was estimated that 1.145 trillion MB of data was created every day. As a result, high throughput is necessary to keep up with the ever-increasing amount of continuously generated data and new technologies to stream that data in real-time (e.g., Apache Kafka). While no single entity needs to ingest and explore all of the data in the world, it’s indicative of the exponential data growth occurring that organizations hope to capitalize on and must inevitably confront.
How FeatureBase Achieves High Throughput:
- During ingestion, the representation and key translation occur on the client-side, so you can offload much of the computation to specific ingest servers, which can be ephemeral (aka exist only while there’s load).
- FeatureBase horizontally scales ingest resources separately from compute resources to maintain performance of both simultaneously.
- FeatureBase is fundamentally good at updates because it has to store a lot less data due to our feature-oriented format, making it faster.
Key Requirement #2: Fresh Data
Data freshness is defined as current and immediately usable data. If you have to preaggregate data before using it, it will not be fresh, and if your data is not fresh, it is not accurate. Let’s use inventory as an example: if your inventory counts are not up-to-date to the millisecond, you could run into issues where you sell two items though you only have one in stock. Currently, the industry relies on preaggregation to improve query latency, but this will not be a feasible workaround as we move towards real-time predictions, AI and ML.
How Traditional Databases Achieve Data Freshness:
- Data comes from source systems in a mix of raw data, like device logs, and curated exports designed for analytics.. Storing these data together with matching keys allows for co-locating objects with actions, or, for example, users, their transactions, and information about the items purchased. However, analytical databases are optimized for giving summary statistics about users, not for returning individual actions a user has taken. To reach the latency required by the end-application, data must be rolled up to different granularities. Data may be limited to certain time ranges due to data storage or computation budget.
- Enterprise organizations have built their data infrastructures with the help of aggregates. Their use improves query response times but relies on duplicating datasets, significantly increasing the overhead and complexity of data pipelines and data governance. In addition to being frustrating and expensive, the data is often out of date by the time query results come in.
How FeatureBase Achieves Data Freshness:
- Molecula FeatureBase has an enhanced data model compared to relational or columnar databases. FeatureBase is optimized to let the individual actions a user has taken flow directly into user tables. This allows queries on unique users and events in the same table, rather than JOINing across multiple tables.
- FeatureBase eliminates preaggregation steps in customer pipelines that tend to cause long delays between when data initially comes into a database and when it’s available to query.
Key Requirement #3: Low latency
Latency is the delay between a user’s action and a response to that action. For our purposes, we’re focusing on query latency – or the delay between when a user hits “run query” and when they receive a result. Traditional relational databases are infamous for being slow to process complex queries on large datasets, particularly those with high cardinality fields. Typically, you see customers with multiple data sources populating a normalized data model. As such, you must do complex JOINs to analyze data across these separate tables, ultimately increasing latency and making queries slow.
One strategy for reducing latency is to denormalize the data or preaggregate it by performing these JOINs ahead of time. This workaround will minimize latency, but preaggregation jobs can take minutes, hours, or days. As a result, the data you are using for analytics and to make decisions about your business is possibly out of date by the time you use it. In addition, these processes are inflexible, and when –not if– you need to add or change features, it can take days to weeks to modify a production environment.
How FeatureBase Achieves Low Latency:
- FeatureBase can process every shard of data in parallel. Because it has to do fewer JOINs in practice, FeatureBase can take further advantage of the parallelization (not as many aggregation steps during query execution).
- FeatureBase’s feature-oriented format structures data so that it does the minimum amount of input/output (I/O) to process analytical queries. As a result, it can granularly address only the particular value within a column that is necessary vs. a whole column or table. This cache-friendly approach allows for linear scans.
- FeatureBase’s feature-oriented format is highly efficient. CPUs try to predict what data you’ll be loading from memory. It’s good at predicting if you’re going straight in order and doing a linear scan of memory (vs. with random access, it has trouble predicting and adds a ton of latency).
Key Requirement #4: High Concurrency
Query concurrency is the volume of queries that are actively executing in parallel. In modern machine learning applications, hundreds or thousands of users at any given moment often need to run simultaneously. Or, in large organizations, you might have thousands of users simultaneously exploring data. But every database has a breaking point, and after a certain number of concurrent queries, the latency skyrockets or queries are queued. This increase in latency is because finite compute resources have to be divided and shared, so queries have to compete or wait for these resources. Ultimately the CPU can do only so many operations at once, and they have to be spent wisely on both ingest and queries. If you have poor concurrency, the database becomes the bottleneck of your entire infrastructure.
One common workaround is to partition resources based on use case or prioritization of queries, but this is inflexible and reduces the efficiency you can get from your hardware. Hence, a modern real-time analytics data infrastructure requires high concurrency without sacrificing performance.
How FeatureBase Achieves High Concurrency:
- Reads are never blocked by other reads or writes.
- Efficient use of memory that allows space to be reclaimed
- Hardware can be scaled to meet concurrency requirements
Many other database systems have a minimum overhead every time you add another node or create another instance, but because FeatureBase is CPU and memory-optimized, its lightweight, feature-oriented platform can inherently service more clients because it uses fewer resources per query. FeatureBase can guarantee the ability to read and make sure that even when writing, the previous data can still be read. This allows FeatureBase to avoid blocking queries simply because it’s waiting for the ability to read them from a database. This is especially beneficial for analytics organizations with multiple users querying a database or for machine-to-machine workflows – like those used in programmatic advertising.
Watch the Webinar: The 4 Requirements to Efficiently Deliver Real-Time Data at Scale
The Most Common Compromise: Brute Force
We would be remiss not to call out that all four of these requirements are technically achievable, even with legacy systems as a starting point, but it requires excessive resource burn. You can scale up your hardware (e.g., using GPUs), scale out workloads to hundreds or thousands of machines, or even hire engineers to build a new solution from the ground up that works specifically for your needs. However, most organizations cannot throw efficiency out the window to achieve performance requirements. They have to find a balance between performance and cost.
The Real-Time Database for Any Scale
A feature-oriented format, which is what underlies Molecula’s product, FeatureBase, is the key to achieving the four requirements of the intelligent analytical stack with massive-scale datasets and without the typical tradeoffs often made today. FeatureBase is a real-time database that sits on top of conventional data models (such as relational or star-schemas), data lakes, and streaming tools like Kafka. It can maintain up-to-the-millisecond data updates with practically no data preparation. FeatureBase lets you power workloads using the freshest data by eliminating the need to perform data aggregation steps with 10x compression on traditional columnar databases and dramatically reduces the compute necessary to power workloads.
Feature-Oriented Format Reduces 100 Servers to 9 Servers at Major Ad Tech Company
One of our customers is an ad tech company with billions of records (people) and billions of attributes. The traditional column-oriented analytical database the customer was using could not meet their needs. FeatureBase required 100x less hardware than the competitive technology they were considering.
With FeatureBase and the feature-oriented format, they were able to achieve the following simultaneously:
- Serve ultra low latency queries, millisecond results
- Ingest 1M events per second, 4x initial target
- Update more than 30B events per day across 5B records, millisecond updates
- Power 200 concurrent queries, scalable concurrency
But perhaps the most exciting part of this is that the customer estimated it would require 100 servers to do this on a competing technology, but with FeatureBase, they accomplished this with just nine servers.