Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. It is the process of preparing the proper input, compatible with machine learning algorithm requirements, and ideally improving machine learning performance.
Good feature engineering is dependent on a number of factors:
- The performance measures you’ve chosen (RMSE? AUC?)
- The framing of the problem (classification? regression?)
- The predictive models you’re using (SVM?)
- The raw data you have selected and prepared (samples? formatting? cleaning?)
At its most basic level, feature engineering is asking yourself — “What should the X inputs be?”
Since each dataset is different, the feature engineering required is also different, but there are established procedures, and with practice, one learns which methods and practices will work best based on the dataset received.
Traditionally, feature engineering has been an iterative process, heavily dependent on data selection, data availability, and model evaluation — over and over. Having a well-defined problem is key to ceasing the process of feature engineering (on an individual project). Without a well-defined problem, the feedback loop could keep you in the feature engineering process for a very long time.
Feature engineering techniques include:
- Imputation: When you’re facing the issue of missing values, the simplest answer is to drop the whole column that includes missing values, but this is not always the best choice for improving model performance. An alternative is imputation, which preserves the data size, replacing missing values with a default value or a median.
- Handling outliers: Outliers can be detected visually (most ideal), through standard deviation, or through percentiles. Through these methods, one can discover and drop any outliers, or alternatively you might choose to cap outliers instead of dropping them, which allows you to maintain your data size.
- Binning: Binning is the process of grouping values to make your data more regularized. Typically binning is used to make models more robust and to prevent overfitting, but it also results in sacrificing information, which can have performance costs. The key focus of the binning process is the create balance in the trade-offs between performance and overfitting. An example of when binning might be appropriate is: if your data size is 100,000 rows, it might be a good option to unite categorical labels with a count less than 100 to a new category like “Other”.
- Log transform: Log transform is one of the most commonly used mathematical transformations in feature engineering. It helps to handle skewed data, allowing data distribution to become more approximate to normal. In most cases, when log transform is used the magnitude order of the data changes within the range of data (i.e. the difference between a 15 year old and a 20 year old is not equivalent to the difference between a 65 year old and 70 year old — 5 years at younger ages has higher magnitude of difference) — log transform can normalize these magnitudes within a dataset (note: this also often results in a decrease of outliers).
- One-hot encoding: One-hot encoding is one of the most common methods of encoding in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column. Categorical data is difficult for machines to understand — one-hot encoding changes your categorical data into a numerical format, and enables you to group your categorical data without losing any information.
- Grouping operations: Often in machine learning algorithms, every instance is represented by a row in the training dataset, where every column shows a different feature of the instance. However, certain types of data like transactional data, rarely fit this standard because of the multiple rows of an instance. In these cases, it’s common to group the data by the instances and represent the data as one row. The main point of group by operations is to determine aggregation functions of the features.
- Feature split: Splitting features extracts the utilizable parts of a column into multiple new features. This enables machine learning algorithms to comprehend them, makes it possible to bin and group them, and can improve model performance. An example of this would be splitting “First Name” and “Last Name” into separate features from an overarching “Name” column.
- Scaling: In most cases, numerical features of a dataset do not have a certain range and they often differ from each other column to column. In real life, it is nonsense to expect age and income columns to have the same range. But from the machine learning point of view, how can these two columns be compared? Scaling solves this problem. The continuous features become identical in terms of the range, after a scaling process.
- Normalization: Scales all values in a fixed range between 0 and 1. This process does not change the distribution of features, but it can increase outliers.
- Standardization: Scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.
- Extracting date: Dates are often included in datasets in varying formats, which is challenging for machine learning unless you manipulate dates into a single format or break them into multiple columns.