Feature Engineering: The Skill That Separates Good ML from Great ML

A 2023 Kaggle survey found that ML practitioners report spending more time on feature engineering and data preparation than on model training and tuning combined. Not slightly more. Significantly more. And yet most ML tutorials spend about two paragraphs on it before rushing to the model.

That imbalance is exactly why so many models work beautifully in a notebook and fall apart in production. The algorithm isn’t the problem. The features are.

Feature engineering is the process of transforming raw data into inputs that give your model the best possible chance of learning something real. It’s partly systematic, partly informed by domain knowledge, and partly the kind of judgment that only comes from building models that failed in specific, instructive ways.

This post is about what actually matters, what you can skip, and the order it all goes in.

What Feature Engineering Actually Is (and Isn’t)

Definition: Feature engineering is the practice of selecting, creating, transforming, and encoding input variables so that a machine learning model can extract meaningful patterns from them. It’s the layer between raw data and model training that determines what the model even has a chance to learn.

It’s not data cleaning, though the two overlap. Removing duplicate rows and fixing corrupted values is data cleaning. Deciding how to represent a “customer account age” column (raw days? log-transformed? bucketed into tiers?) is feature engineering.

And it’s not the same as feature selection, though people conflate them. Feature selection is choosing which existing features to include. Feature engineering is transforming what you have, or creating new columns entirely, before you make that selection.

The reason it matters so much is that most ML algorithms are fundamentally pattern-matching machines over numerical inputs. They can’t reason about what a feature means. If you feed a model the raw timestamp 2024-03-15 09:42:00, it has no idea that 9am on a Friday is different from 9am on a Sunday. You have to tell it, by engineering features that encode that distinction.

The Features Nobody Tells You to Check First

Before any transformation, there are two things you should check on every column.

Missing values. How you handle them matters more than most tutorials admit. Dropping rows with missing data is the laziest option and often the most destructive one. If a feature is missing non-randomly (say, customers who never made a purchase have no purchase history), that missingness itself is informative. Engineer a binary flag: has_purchase_history = 1/0. Then impute the missing values separately. You get both the imputed feature and the signal that data was absent.

Mean imputation for numerical columns is fine as a default, but it kills variance. If a feature has a lot of missing values and the distribution is skewed, median imputation or model-based imputation is worth the extra ten minutes.

Cardinality in categorical columns. A country column with 3 values is a different problem than a product_sku column with 40,000 values. One-hot encoding the second one will explode your feature space and likely hurt your model. I’ve had people hand me datasets with one-hot encoded columns that added 12,000 binary features, most of which were all zeros. The model learned nothing from them and trained in about four times the time it needed to.

For high-cardinality categoricals, target encoding is usually the right call. For low-cardinality (under 10-15 values), one-hot encoding works well. For ordinal categories (small, medium, large), ordinal encoding preserves the order that matters.

Normalization and Scaling: Which Algorithms Actually Need It

Here’s the thing nobody tells you directly: tree-based models don’t care about feature scaling at all.

Decision trees, random forests, gradient boosting — they split on feature values, not distances. Whether income is in raw dollars (40000) or normalized (0.4) makes zero difference to the splits. If you’re training an XGBoost model, you can skip this section entirely and it won’t cost you anything.

But for algorithms that use distance or gradients, scaling is not optional.

K-nearest neighbors: distances between points are meaningless if features are on wildly different scales. A feature ranging from 0 to 100,000 will dominate a feature ranging from 0 to 1.
Support vector machines: same problem, same reason.
Neural networks: gradient descent converges much faster and more reliably when inputs are on a consistent scale. Most practitioners normalize to [0, 1] or standardize to mean 0, standard deviation 1.
PCA and other dimensionality reduction methods: standardization before PCA is not optional. PCA maximizes variance, so features with larger raw scales will dominate the principal components regardless of actual importance. Sebastian Raschka’s detailed breakdown of standardization vs. normalization for PCA is the clearest treatment of this I’ve found.

Standardization (z-score, mean 0, std 1) is usually the safer default. Min-max normalization scales to a fixed range and is sensitive to outliers. If your feature has real outliers that you want to preserve, standardization handles them more gracefully.

Interaction Features and When to Actually Create Them

An interaction feature is one you engineer by combining two existing features. The classic example: you have price and quantity, and you engineer revenue = price * quantity. Your model could theoretically learn this relationship from the raw features, but handing it the interaction directly makes learning faster and often more stable.

So the question becomes: how do you decide which interactions to create?

Domain knowledge is the honest answer, and it’s underrated. If you know that a ratio of two features has a real-world meaning (debt-to-income ratio, click-through rate, churn risk score), that interaction feature is almost always worth adding. Your model won’t discover the ratio on its own, at least not reliably.

Automated interaction feature generation is tempting. Libraries exist that will generate polynomial combinations of all your features. Be careful. If you have 50 input features and you generate all pairwise products, you’re adding 1,225 features, most of them meaningless, all of them increasing your risk of overfitting. A few targeted interactions built on domain knowledge are worth a hundred automated ones.

The test I use: would a person who knows this domain understand why this feature should matter? If yes, add it. If you’re doing it because a feature importance score looked promising, validate carefully before committing.

Dimensionality Reduction: When It Helps, When It Doesn’t

When your feature space is genuinely high-dimensional (hundreds or thousands of features), dimensionality reduction can meaningfully improve training speed, reduce overfitting, and sometimes improve performance by removing correlated or noisy features.

PCA is the standard starting point. It creates new features (principal components) that are linear combinations of your original features, ordered by how much variance they capture. The trade-off is interpretability: a principal component isn’t a column you can explain to a stakeholder, it’s a mathematical combination. I cover when that trade-off is and isn’t worth making in the PCA explainer.

A few things that are worth knowing before you use PCA. First, it requires standardization before you run it, as mentioned above. Second, it assumes linear relationships. Third, it can hurt performance if your problem is actually low-dimensional and well-behaved, you’re adding complexity for no gain. Start by looking at your feature correlation matrix. If you have 40 features but most of them are near-zero correlation with your target and high correlation with each other, PCA is probably worth trying. If your features are reasonably independent and all informative, it’s probably not.

For classification problems specifically, Linear Discriminant Analysis (LDA) is sometimes a better choice than PCA. LDA maximizes class separability rather than total variance, which is often what you actually want. But that’s a longer topic.

Feature Selection: Cutting What Doesn’t Help

After you’ve done transformation and creation, feature selection is often worth a pass. The practical question is: are you using all these features, or are some of them noise that will slow training and potentially hurt generalization?

The fastest approach is feature importance from a trained tree-based model. Train a quick random forest, look at which features have near-zero importance, and consider dropping them. This takes about five minutes and often removes 20-30% of features without touching model performance.

The more careful approach is permutation importance: train your model, then randomly shuffle one feature at a time and measure how much performance drops. If shuffling a feature makes no difference, the model isn’t using it. This is more computationally expensive but gives a cleaner signal.

What I’d avoid for most production use cases: recursive feature elimination and automated wrapper methods that retrain the model dozens of times. They can find a better feature subset, but the computational cost is high and the gains over a good quick pass are usually marginal.

The Order That Actually Matters

Most tutorials present feature engineering as a bag of techniques with no particular sequence. But order matters, and getting it wrong causes bugs that are hard to trace.

The sequence I use:

Handle missing values (including missingness indicators)
Encode categorical variables
Create interaction and domain-knowledge features
Scale and normalize numerical features
Dimensionality reduction if needed
Feature selection

Why does order matter? Because you shouldn’t normalize before you encode. You shouldn’t create interaction features from un-imputed nulls. And critically: fit your scalers and encoders on the training set only, then apply them to the validation and test sets. If you fit on the full dataset before splitting, you’ve leaked information about the test distribution into your training pipeline and your evaluation metrics will be optimistic. It’s a rougher version of the same problem I described when we talked about overfitting.

What This Post Didn’t Cover

I didn’t cover time-series specific features (lag features, rolling windows, seasonality encoding). That’s a big enough topic to warrant its own post and the rules are different enough from standard tabular data that mixing them here would muddy both.

I also didn’t go deep on automated feature engineering tools like Featuretools. They can generate useful features but also generate enormous amounts of garbage. Worth knowing they exist; not worth relying on without careful validation.

FAQ

What’s the difference between feature engineering and feature selection?

Feature engineering is transforming or creating input variables to give your model better inputs. Feature selection is choosing which of those inputs to actually include. Engineering comes first, then selection. You can’t select a feature that doesn’t exist yet, and selection is more meaningful once your features are already in a useful form.

Does feature engineering matter less for deep learning?

Yes and no. Neural networks can learn some feature representations automatically, which is part of why they work well on raw images or text. But for tabular data, the kind most business ML problems involve, feature engineering still has a large impact. Deep learning on raw tabular data typically underperforms well-engineered XGBoost. The “just feed it raw data” approach works for images. It usually doesn’t work for rows in a database.

How do I know if a new feature is actually helping?

Add it, retrain, evaluate on your validation set. If validation performance improves, keep it. If it doesn’t, drop it. Don’t evaluate on training performance, you’ll be measuring memorization, not improvement. And don’t add ten features at once then evaluate, you won’t know which one did what.

The most reliable pattern I’ve seen across different ML problems: teams that invest seriously in feature engineering routinely outperform teams that invest in model architecture. A well-engineered feature set with a simple model beats a poorly engineered feature set with a complex model more often than it has any right to. The algorithm optimizes what you give it. Give it better inputs and you’ve done most of the work before training even starts.