[ fee-cher ] n. A feature is an individual, measurable property, attribute, or characteristic of a phenomenon being observed to serve as a computationally efficient input variable for a given system or model.
The Origin of Features
The technique of using features was pioneered by data scientists who needed to prepare data for demanding machine learning and AI workloads. Features have historically been extracted from source data in a process called feature extraction, which is one of the initial steps in the overall feature engineering lifecycle. Feature extraction has typically been a manual process performed by data scientists, once she has the source data, usually exported in an adHoc process from IT databases by data engineers. Once the data scientist has the data, it is almost always moved to their laptops for processing, which consists of several steps including data preparation, feature ranking, feature selection, feature transformation, and feature reuse.
A feature, in its purest form, is information that can serve as an input in a model. It is an attribute or unique variable. A feature represents the presence of a particular attribute for a given record with 100% fidelity to the underlying dataset. A feature, which is conceptually interchangeable with a “column” in tabular data, represents a measurable piece of data that can be used for analysis: Name, Age, Sex, Fare, and so on.
Because the process of generating features has required considerable manual efforts, only a small subset of original data is usually extracted into features.
How Features Work in Molecula
With Molecula, we have invented a technology that converts semi-structured and fully-structured data into features. Once extracted, we store features in a feature-oriented format that is managed inside of Molecula’s FeatureBase. This process is especially effective on datasets that are Terabyte in scale and/or generating millions and billions of events per day.
When features are stored and retrieved from FeatureBase, features are able to securely serve analytical, high-concurrency workloads in milliseconds, while creating a 60-90% smaller footprint than the data they are representing. This performance, which traditional information-era systems can not achieve, allows for transformations and joins to happen directly in your model, either in training or production.NEXT