Today, I’ll discuss a often underrated topic in the process of building machine learning models: advanced feature engineering.
Many of us have probably heard of this concept. It’s usually covered during studies or in popular courses. However, from my experience, when we create practical ML projects, we often don’t give enough attention to properly preparing features for the model. We tend to be more eager to experiment with hyperparameters or try different architectures than to develop high-quality features.
Yet, it’s often the intelligent, thoughtful selection of features that allows models to achieve high accuracy and determines the success or failure of the entire project.
Therefore, in this post, I’ll present a classification of feature engineering methods that I’ve derived from observing various projects. Depending on the task you’re working on, I encourage you to use some or all of these feature engineering methods to maximize the potential of your data and train the best possible model.
I will focus mainly on so-called structured data (tabular data) — assuming our data is in a table with a defined number of rows and columns. The result of feature engineering will be a new table — with the same number of rows (where each row represents a specific element of our dataset), but potentially with a different number and content of columns.
We can distinguish three basic operations in feature engineering:
– Feature transformation, which involves modifying the content of a specific column;
– Feature selection — simply removing some less significant columns;
– Generating new features, which involves adding columns based on the content of one or more existing columns.
Why do we even perform feature engineering?
To make the model’s prediction task easier. Remember, a model is simply a function that transforms input data into output. Even if it is very advanced, it still has limitations that we can mitigate by appropriately modifying the input data.
Feature Transformation
In this approach, we transform a specific feature (column) of the data into a new column or a larger number of columns. Several possibilities exist here:
Variable Scaling
Standardization
This operation is performed by the `StandardScaler` from the Scikit-learn library. It transforms features so that they have a mean of 0 and a variance of 1. This is necessary for some models (regression, SVM, neural networks) and is almost always helpful as it eliminates scale differences between features, ensuring the model treats them equally, without being influenced by their scale.
However, remember that standardization makes sense with feature distributions close to normal distribution. If your distribution is of the “fat tails” type, meaning there are many extreme values that skew the mean or variance, standardization might not be appropriate, especially if outliers haven’t been removed from the dataset.
Normalization
This is a linear transformation of our feature to a specific range, e.g., [0, 1]. This is handled by the `MinMaxScaler` function. It’s a good approach when we want to ensure that features fall within a specific range (e.g., because our model doesn’t handle negative values or values greater than 1). Similar to standardization, outliers can significantly distort this transformation.
Categorical Encoding
This transformation is necessary if your table contains so-called categorical variables (e.g., occupation, city, product), and your model requires numerical data (for example, a classic neural network). There are several ways to encode categories as numbers:
- One-hot encoding — creates columns for each category, filled with 0s and 1s, indicating the presence of a particular category in the original column. If, for example, we have five possible values in the “occupation” column, we replace it with five new columns.
- Frequency encoding — each category receives a single number that encodes how important the variable is in relation to the value we want to predict. This is how encoding works in the CatBoost algorithm.
- Embeddings — used especially with a large number of categories. Instead of vectors with a length equal to the number of categories (as in one-hot encoding), we create shorter vectors that encode more complex relationships between categories. This method can encode not only categories but also unstructured data like text or images.
Feature Selection
Feature selection is essential when you have too many features, which can cause the algorithm to become confused by the dataset (it is generally accepted that there should be at least several dozen times more rows than columns). This can be done in several ways:
– Variance analysis — calculating the variance of each column and selecting those with the highest variance, as they usually carry more information.
– Correlation analysis — checking correlations between features and the target, as well as among the features themselves. We remove features with low correlation with the target and those that are highly correlated with each other.
– Advanced algorithms, e.g., the Boruta algorithm, which uses random forests to assess the importance of features, eliminating those that are not more important than random features.
Generating New Features
Generating new features often determines the success of a model, as it allows capturing relationships that the model cannot handle on its own (for example, linear regression cannot handle products between features). In this aspect, contact with domain experts, who have knowledge or intuition about which features might be important, often helps. Several groups of new features can be distinguished:
Feature Aggregates
We aggregate a single column using some group operation, such as sum, average, or maximum value (here’s a list of ready-to-use aggregation functions implemented in the Pandas library).
Relational Features
Based on mathematical relationships between two or more features. The simplest examples are ratios of two variables. If, for instance, we have information about a country’s GDP and its population, we can create a new feature — GDP per capita.
It’s worth noting that such an operation can be crucial for predictive algorithms based on decision trees, which, if not explicitly given the ratio of two columns, will try to learn this relationship by creating many nodes in the tree.
Temporal Variables
Possible to create where data has a time aspect. In this case, we create a new feature that takes this time aspect into account. If we have information about movies rated by users, a new feature could be, for example, the number of movies rated by a user in the past week. Here, we can apply different time intervals (e.g., last day, week, 2 weeks, month, etc.) and aggregation functions (sum, maximum, etc.).
Combined Features
This is a combination of relational features, aggregates, and temporal variables. An example of such an advanced feature could be the average number of views of a particular movie category by a user in a given month, divided by the average number of views of that category by all users.
It’s worth noting that these operations can increase the number of features by dozens of times. In this case, feature selection might be necessary again. Remember that working with data is an iterative approach — much like model building. Therefore, we often apply approaches from different categories (transformation — generation — feature selection) interchangeably until we achieve the best model performance.
It’s also worth combining the feature engineering approach with feature importance analysis (e.g., through feature_importance parameters in random forests, or through additional algorithms like SHAP). If we see that a new feature has high importance, it’s worth delving deeper and generating its different variants to further simplify the model’s task.
Summary
Advanced feature engineering is a crucial element in building effective machine learning models. It allows you to maximize the potential of your data and significantly impact prediction results. It is part of the so-called Data-Centric Approach, where the main focus in a project is on improving the data, rather than testing different models and their hyperparameters.