The Value of “Real Time” in Data Engineering

Implementing a true real-time architecture has benefits beyond just speed

“Real time” is a common term that sounds intuitive enough, but for those of us in the analytics and machine learning space, it can be loaded with nuances that aren’t always clearly defined. In everyday language, when someone refers to something as “real-time,” they typically imagine some kind of interface that reacts to their senses instantly. For example, if you click “refresh” and your bank balance is updated, that feels instant. Sometimes something feels real time, but it isn’t actually real time because unbeknownst to the user, it is based on stale data. This article will unpack the meaning of what we call true real time and explain how adding real-time data as a foundation to your architecture can improve your product, lower costs, and set you up for AI success.

What is Real Time Data Analytics?

In contrast to the layperson’s definition, the data engineering perspective of “real time” doesn’t refer to an amount of time, but to a methodology whereby data is not stored or held before it is put to use. That on its own doesn’t necessarily mean the data will feel instant to the end user. At Molecula, we assert that there are two requirements for an application to be true real-time: 1) the data that is being accessed must be fresh and 2) queries returned on the data must perform with ultra-low latency (see Figure 1).

Figure 1

difference between freshness and latency in machine learning

Most challenges with implementing AI and high-performance analytics applications are faced in one or both of these areas. And most attempted solutions use different technologies to address each of these challenges. However, if you are willing to imagine a solution that doesn’t iterate on existing technologies, there is a fundamentally superior approach to data storage that solves both data freshness and query speed issues. Molecula’s FeatureBase is the single solution that enables true real-time processing at enterprise scale.

How Molecula FeatureBase Enables True Real Time

Most applications don’t operate on raw data. Typically, data undergoes some sort of processing before it is actually usable. The process looks something like this:

traditional-data-flow, How Molecula FeatureBase Enables True Real Time data

While storage is relatively cheap and making copies doesn’t sound so terrible, you have to keep in mind that every time you make a copy of any data, you lose time, and you are faced with managing each copy in terms of security, freshness, and infrastructure. With 85% of the world’s stored data being copies, this methodology has become a scourge on real-time data engineering.

How to Achieve Real-Time Data Without Copies or Caching

The premise of FeatureBase is that you don’t need to make copies of your data in order to make it instantly computable and accessible. If you can speed up data access so fast that you can perform complex JOINs and queries in real-time, you can cut out the inefficient “middlemen” of the data access process. The FeatureBase innovation is a technique that reduces data to its most fundamental bits so that computers can operate directly with the values of the source data. The process with FeatureBase looks like this:

How to Achieve Real-Time Data Without Copies or Caching

FeatureBase automatically normalizes complex schemas and encodes data into the most computationally efficient format possible. The result is a fully computable, but lighter and faster feature format purposely designed for machine processing. A human would not be able to look at a FeatureBase extraction and make any sense of it; it’s all 1’s and 0’s representing every possible combination of the data in an infinite state table. However, for real-time applications, machines are the consumers of the data, not humans. Automatically converting all data into features at the beginning of your process is what we call a feature-first approach. If you can continuously automate the conversion of your data into machine-friendly (i.e. fast, lightweight) features that are instantly accessible to all applications, you have at once solved both the freshness and query speed issues of real-time delivery. The ability to access features is so fast that queries can literally be made to the freshest data on the fly and results are returned in milliseconds.

Having data continuously available for any use case in a format that machines can instantly compute without pre-aggregation, pre-processing, or copying is the holy grail of real-time architecture—true real time

But Wait, There’s More
Being able to implement a true real-time architecture is an obvious benefit of FeatureBase. But once you have embraced the feature-first approach to big data architecture, you’ll discover a cascade of additional benefits to this new way of thinking about data storage and access.

For example, remember how 85% of data is made up of copies? Imagine the savings in time, cost, and hassle by permanently reducing your data footprint after eliminating the need for all those copies. Another extreme benefit is in the ML lifecycle. Imagine if data scientists could reuse features from project to project across departments. The time saved by not having to make individual, predetermined requests from IT can be put to better use in developing and testing models. Further, when all the data is accessible through automated feature extraction and storage, data scientists are afforded more flexibility and creativity in that they don’t have to determine in advance what features they need pre-computed—the computations can be made ad-hoc in real time. The feature-first approach also simplifies the process of going to production and eliminates online vs. offline skew since you’re able to train models with the actual production data.

Real-Time Recap
In summary, the need and opportunity for real-time applications are increasing. With traditional methodologies, real time is becoming impossible to achieve as the data death spiral only makes things harder to unravel. Adopting a feature-first mindset enables true real-time access and latencies that are so low you can perform transformations, calculations, and data prep directly in your model, your query, or your code. Molecula’s FeatureBase automatically and continuously extracts features from your raw data that represents all of the trends and patterns that data analysis allows you to discover, but without the weight of data itself. From freshness, to latency, to reuse, FeatureBase is an investment with tangible returns across your stack and across your organization.

what is real time data?