The Value of “Real Time” in Data Engineering
Implementing a true real-time architecture has benefits beyond just speed
“Real time” is a common term that sounds intuitive enough, but for those of us in the analytics and machine learning space, it can be loaded with nuances that aren’t always clearly defined. In everyday language, when someone refers to something as “real-time,” they typically imagine some kind of interface that reacts to their senses instantly. For example, if you click “refresh” and your bank balance is updated, that feels instant. Sometimes something feels real time, but it isn’t actually real time because unbeknownst to the user, it is based on stale data. This article will unpack the meaning of what we call true real time and explain how adding real-time data as a foundation to your architecture can improve your product, lower costs, and set you up for AI success.
What is Real Time?
In contrast to the layperson’s definition, the data engineering perspective of “real time” doesn’t refer to an amount of time, but to a methodology whereby data is not stored or held before it is put to use. That on its own, doesn’t necessarily mean the data will feel instant to the end user. At Molecula, we assert that there are two requirements for an application to be true real-time: 1) the data that is being accessed must be fresh and 2) queries returned on the data must perform with ultra-low latency (see Figure 1).
Most challenges with implementing AI and high-performance analytics applications are faced in one or both of these areas. And most attempted solutions use different technologies to address each of these challenges. However, if you are willing to imagine a solution that doesn’t iterate on existing technologies, there is a fundamentally superior approach to data storage that solves both data freshness and query speed issues. Molecula’s FeatureBase is the single solution that enables true real-time processing at enterprise scale.
How Molecula’s FeatureBase Enables True Real Time
Most applications don’t operate on raw data. Typically, data undergoes some sort of processing before it is actually usable. The process looks something like this:
While storage is relatively cheap and making copies doesn’t sound so terrible, you have to keep in mind that every time you make a copy of any data, you lose time and you are faced with managing each copy in terms of security, freshness, and infrastructure. With 85% of the world’s stored data being copies, this methodology has become a scourge on real-time data engineering.
Achieve Real-Time Without Copies or Caching
The premise of FeatureBase is that you don’t need to make copies of your data in order to make it instantly computable and accessible. If you can speed up data access so fast that you can perform complex JOINs and queries in real-time, you can cut out the inefficient “middlemen” of the data access process. The FeatureBase innovation is a technique that reduces data to its most fundamental bits so that computers can operate directly with the values of the source data. The process with FeatureBase looks like this:
FeatureBase automatically normalizes complex schemas and encodes data into the most computationally efficient format possible.The result is a fully-computable, but lighter and faster feature format purposely designed for machine processing. A human would not be able to look at a FeatureBase extraction and make any sense of it; it’s all 1’s and 0’s representing every possible combination of the data in an infinite state table. However, for real-time applications, machines are the consumers of the data, not humans. Automatically converting all data into features as the beginning to your process is what we call a feature-first approach. If you can continuously automate the conversion of your data into machine-friendly (i.e. fast, and lightweight) features that are instantly accessible to all applications, you have at once solved both the freshness and query speed issues of real-time delivery. The ability to access features is so fast that queries can literally be made to the freshest data on the fly and results are returned in milliseconds.
Having data continuously available for any use case in a format that machines can instantly compute without pre-aggregation, pre-processing, or copying is the holy grail of real-time architecture—true real time
But Wait, There’s More
Being able to implement a true real-time architecture is an obvious benefit of FeatureBase. But once you have embraced the feature-first approach to big data architecture, you’ll discover a cascade of additional benefits to this new way of thinking about data storage and access.
For example, remember how 85% of data is made up of copies? Imagine the savings in time, cost, and hassle by permanently reducing your data footprint after eliminating the need for all those copies. Another extreme benefit is in the ML lifecycle. Imagine if data scientists could reuse features from project to project across departments. The time saved by not having to make individual, predetermined requests from IT can be put to better use developing and testing models. Further, when all the data is accessible through automated feature extraction and storage, data scientists are afforded more flexibility and creativity in that they don’t have to determine in advance what features they need pre-computed—the computations can be made ad-hoc in real time. The feature-first approach also simplifies the process of going to production and eliminates online vs. offline skew since you’re able to train models with the actual production data.
In summary, the need and opportunity for real-time applications is increasing. With traditional methodologies, real time is becoming impossible to achieve as the data death spiral only makes things harder to unravel. Adopting a feature-first mindset enables true real-time access and latencies that are so low you can perform transformations, calculations and data prep directly in your model, your query, or in your code. Molecula’s FeatureBase automatically and continuously extracts features from your raw data that represent all of the trends and patterns that data analysis allows you to discover, but without the weight of data itself. From freshness, to latency, to reuse, FeatureBase is an investment with tangible returns across your stack and across your organization.