A Feature-First Primer for Better AI

By: Molecula

TL;DR

  • In data science, a feature is a type of measurable data upon which a decision can be made. Features enable machines to make decisions instantly. 
  • During the critical feature selection step of any ML project, data scientists must decide what features they need to engineer from the raw data to power models that ultimately drive business outcomes.
  • Today’s standard ML process involves a “request-and-wait” approach to features because raw data requests can take days or weeks to deliver, the data is not fresh, and the features are not easy to share or reuse across projects 
  • Feature-first is different. By automatically converting your data to a machine-native feature-first format from the start, it is computable and ready for feature selection and model training and production at all times.

A More Detailed Look at Feature-First:

Molecula is an operational AI company that enables businesses to deploy real-time analytics and AI in their applications (without pre-processing) through the adoption of a feature-first mindset. We often get the question: What does feature-first mean?

Let’s explain.

When developing AI, data scientists build models that look for and learn from patterns in data, and then use these patterns to make decisions and predict future outcomes. Models do not work without large quantities of high-quality data, and the data that models use are called features — you can’t just throw all your raw data into models.

What is a Feature in Data Science?

In simplest terms, a feature is a data point that is of interest to a data scientist for building a model. But it’s not just any data, a feature is a type of measurable data upon which a decision can be made. For example, “animal” is not an actionable feature, but the value “is an animal,” would be a useful feature. If you know whether or not something is an animal, you can make a decision based on that information.

This is important because machine learning is a series of decision trees. Features enable machines to make decisions. Billions of them. Instantly.

An oversimplified example would be when looking at data about animal traits, you could feed the following features into your model: “is a pig” and “has wings.” The actual feature values would be either ”yes” or “no,” i.e. “1” or “0.” This format enables the model to make a fast decision based on the feature trait. If your model then processed millions of records looking at features to see whether an animal is a pig AND has wings, the model would learn that pigs do not have wings. Or from the computer’s binary perspective, those two features are never both “1.”

Features are the MVPs of the ML Process

Feature selection is one of the most critical steps in a successful ML project. Data scientists must decide what features they need to engineer from the raw data to power models that ultimately achieve the desired business outcomes. This is both an art and a science because it’s not always clear in advance which features will be valuable in a model.

Think of a data scientist as a master chef. Now imagine if that chef was developing a new recipe and didn’t have sugar, salt, butter, or spices in their kitchen. Features are the ingredients for models. In today’s standard ML process, data scientists must ask IT for each ingredient in advance and wait days or weeks for access to only those precise ingredients. If the data science chef wants to try adding a little butter to thicken the sauce, they have to go to the farm, make a request for butter, wait for the cows to get milked, the cream to be churned, and so on.

Today’s standard ML process is not feature-first. There are significant downsides to the current “request-and-wait” approach to features:

  • Raw data requests can take days or weeks to deliver
  • The data is delivered as a static snapshot in time, so it is not fresh
  • Sometimes features are computed from other data, so it isn’t easy to trace their lineage or understand the context in which they were created (which means they lose any hope of being reusable)
  • Since features are selected and engineered by the data scientist after the data dump (often on a laptop), they are not easy to share or reuse across projects or departments. Data scientists and data engineers end up going back to the raw data and repeating the process any time features are needed (resulting in duplicated efforts)
  • When a model goes into production, there is nearly always a discrepancy between the training data and the production data that must be ironed out

The Benefits of a Feature-First Approach

Feature-first is different. Feature-first means automatically converting your data to a machine-native format so that it is computable and ready for feature selection and model training at all times. It’s the equivalent of stocking the data scientist’s kitchen with every possible ingredient at arm’s length before, during, and after recipe development. A feature-first format is written in a model’s native language, allowing a model to recognize patterns in the data at the fastest rate.

At Molecula, we developed a way to automatically and continuously extract features from data before the data scientists even begin to build models. Our flagship product, FeatureBase, monitors all sources of data and translates it into a radically efficient “1” and “0” format, resulting in extremely fast computations and machine-based decision making—on all data, not just a select few predetermined features.

When turning data into features is the first step in the ML process, the result is a faster, more efficient, more flexible, more secure, and reusable data investment. Some of the benefits include:

  • Data scientists don’t have to request access to (or a copy of) the raw data. Not only does this save time, but it’s a good security practice.
  • Data scientists have full access to all the features that can be possible in a dataset, and it is continuously updated to reflect source data in real time.
  • Features can be selected for use in models at any time, because all the data is already put in the “yes” or “no,” decision-ready format.
  • Features can be computed at any time, so things like averages or sums can be computed on the fly, within code or models, and the source values can still be accessed.
  • The data the models are trained on is the same data that you put into production.

Furthermore, when all of your data is converted to the machine-native 1’s and 0’s format from the outset, everything you do with the data is faster and more efficient, not just ML modeling. That’s why we believe the feature-first paradigm is so revolutionary to the ML, AI, and big data industries.

Human-scale data challenges have effectively been solved (through storage clouds like Snowflake). The goal with adopting a feature-first mindset is to focus on solving the machine-scale challenges which we have only just begun to face, or even to comprehend.

Applying a feature-first approach to analytics and AI initiatives enables truly real-time use cases through the elimination of pre-processing, unlocking a wealth of opportunity for companies of all sizes to transform their investment in big data into tangible business outcomes.