XGBoost, LightGBM, CatBoost: Which Should You Actually Use?
The usual framing of this comparison is wrong.
Most posts present XGBoost, LightGBM, and CatBoost as three flavours of the same thing, then hand you a decision table that says “use LightGBM for large data, CatBoost for categoricals, XGBoost for everything else.” That’s not useless, but it treats these libraries as arbitrary choices rather than deliberate engineering decisions.
Here’s what most comparisons miss: each of these three libraries was built specifically to fix a problem that the previous one didn’t solve. XGBoost fixed the scalability and regularization weaknesses of classical gradient boosting. LightGBM fixed XGBoost’s training speed on high-dimensional data. CatBoost fixed a statistical bias problem that XGBoost and LightGBM both still have. Understanding what each library actually solved changes how you choose between them.
All three are implementations of gradient boosting, a technique where you build an ensemble of weak learners — typically shallow decision trees — sequentially, with each tree trained to correct the errors of the previous ones. The ensemble gets stronger with each addition because every tree is guided by the residual gradient of the loss function. That’s the shared foundation. Where they diverge is in how they find the best split points, handle certain types of data, and manage the bias introduced during training.
Table of Contents
What XGBoost Actually Introduced
When Tianqi Chen and Carlos Guestrin introduced XGBoost at KDD 2016, the system’s defining contribution wasn’t speed in isolation — it was the combination of second-order optimization with built-in regularization built directly into the tree-building objective.
Classical gradient boosting uses only first-order gradient information (the direction of error) to guide each new tree. XGBoost uses both first-order and second-order gradient statistics simultaneously. The second-order term, the Hessian, tells the algorithm not just which direction to move but how large a step makes sense given the curvature of the loss function at that point. In practice this means XGBoost can find better split points with fewer trees, and its learned leaf weights are mathematically tighter.
More importantly, XGBoost embeds two regularization penalties directly into its learning objective. The first penalizes the number of leaves in each tree (encouraging simpler structures). The second applies L2 regularization to the leaf weight values. When both penalties are zero, the system falls back to standard gradient boosting. When tuned, they prevent the kind of overfitting that standard gradient boosting is notorious for on noisier datasets. This was a genuine advance: prior implementations treated regularization as something you bolted on afterwards. XGBoost made it structurally native.
The paper also introduced a sparsity-aware split-finding algorithm that ignores missing and zero values during feature scanning, which made it substantially faster on sparse tabular data without any preprocessing on the user’s part.
By Chen and Guestrin’s own report, in 2015, seventeen of the twenty-nine published winning solutions on Kaggle used XGBoost — and eight of those used it exclusively, without combining it with any other model.
import xgboost as xgb
model = xgb.XGBRegressor(
n_estimators=500,
learning_rate=0.05, # shrinkage: scales newly-added tree weights
max_depth=6,
reg_alpha=0.1, # L1 regularization on leaf weights
reg_lambda=1.0, # L2 regularization on leaf weights
subsample=0.8,
colsample_bytree=0.8, # column (feature) subsampling per tree
random_state=42
)
model.fit(X_train, y_train)The learning_rate here deserves a note. XGBoost’s paper calls this “shrinkage” — it scales the contribution of each newly added tree by this factor, reducing the influence of any single tree and explicitly leaving room for future trees to correct remaining errors. At small values (0.01–0.1), more trees are needed but the model generalises better.
What LightGBM Solved That XGBoost Didn’t
Despite XGBoost’s improvements, it still had a computational bottleneck on large, high-dimensional datasets. To find the best split point for any feature, it needed to scan all training examples to estimate information gain across all possible thresholds. On a dataset with millions of examples and hundreds of features, that becomes the dominant cost.
Microsoft Research’s Ke et al. published LightGBM at NIPS 2017 with two innovations designed specifically to break this bottleneck.
Gradient-based One-Side Sampling (GOSS) starts from an observation that not all training examples contribute equally to the information gain calculation. Instances that the model already predicts well — those with small gradients — are by definition already handled; they contribute little additional signal about where to split. Instances with large gradients are the poorly-predicted ones, and they carry most of the useful information. GOSS keeps all large-gradient instances and randomly samples a smaller fraction of small-gradient ones, reweighting the sampled instances to compensate for the dropped data. The paper proves mathematically that this approximation introduces bounded error into the information gain estimate, and the experiments showed that LightGBM could achieve almost identical accuracy to full-data training at a fraction of the cost.
Exclusive Feature Bundling (EFB) tackles the feature dimension. In many real-world datasets — particularly those with one-hot encoded categorical variables — most features are sparse and mutually exclusive: two features rarely both have non-zero values for the same example. EFB bundles such features together into a single composite feature, reducing the effective number of features without materially affecting split quality. The optimal bundling problem is NP-hard, but the paper demonstrates a greedy approximation that works well in practice.
Together, GOSS and EFB allowed LightGBM to accelerate training by up to over 20 times compared to conventional gradient boosting while achieving almost identical accuracy — those are the numbers from the original paper.
LightGBM also introduced leaf-wise tree growth as default, in contrast to XGBoost’s level-wise approach. Level-wise grows all leaves at the current depth before going deeper, producing balanced trees. Leaf-wise always splits whichever leaf offers the largest reduction in loss, regardless of balance. Leaf-wise trees tend to achieve lower training loss with fewer splits, but can overfit more aggressively on small datasets. The num_leaves parameter (not max_depth) is the right lever to control this, and it’s often the most important hyperparameter to tune.
import lightgbm as lgb
model = lgb.LGBMRegressor(
n_estimators=500,
learning_rate=0.05,
num_leaves=31, # key lever for leaf-wise growth; lower = less overfit
min_child_samples=20, # minimum data in a leaf
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
random_state=42
)
model.fit(X_train, y_train)For datasets with millions of rows and hundreds of features, LightGBM is genuinely the right default. It’s not just “faster” in a hand-wavy sense — the speed difference comes from principled algorithmic changes backed by theoretical guarantees.
What CatBoost Solved That Both Others Missed
CatBoost came from Yandex in 2018 (published at NeurIPS), and its contribution is the most conceptually interesting of the three: it identified and fixed a statistical bias present in all previous gradient boosting implementations, including XGBoost and LightGBM.
The problem is called prediction shift, and it’s a subtle form of target leakage baked into standard gradient boosting training. Here’s the issue. When you train a gradient boosting model, each new tree is fitted to the negative gradients of the loss on the training set. Those gradients are computed using the current ensemble — which was itself trained on that same training set. This means the gradient estimates are not independent from the training labels. The distribution the model sees during training is therefore different from the distribution it will see at test time. Prokhorenkova et al. proved formally that this causes a systematic prediction shift: the model is biased, and the bias shrinks only as dataset size grows large.
Standard preprocessing of categorical features compounds the problem. The popular approach of converting a categorical column to its per-category target mean (target encoding) uses the label of each example in the calculation that includes that same example — another form of leakage that inflates in-sample performance.
CatBoost addresses both problems through an ordering principle. For ordered boosting, instead of computing gradients using the full training history, CatBoost maintains separate model versions trained on increasingly large prefixes of the data (using a random permutation). When computing the gradient for example k, it uses a model that was trained only on examples that preceded k in the permutation — so the gradient estimate for each example is genuinely independent from its own label. For ordered target statistics on categorical features, the same logic applies: the target statistic for each example is computed from the subset of examples that preceded it, avoiding leakage.
As a side consequence, CatBoost uses symmetric (oblivious) decision trees as its base learners — every node at the same depth applies the same split condition. This produces shallower, more balanced trees than standard CART, which are faster to apply at prediction time and less prone to overfitting despite being individually weaker.
from catboost import CatBoostRegressor
cat_features = [0, 3, 7] # column indices of categorical features
model = CatBoostRegressor(
iterations=500,
learning_rate=0.05,
depth=6,
l2_leaf_reg=3.0, # L2 regularization
cat_features=cat_features, # pass raw categoricals — no encoding needed
verbose=0,
random_seed=42
)
model.fit(X_train, y_train)The ability to pass raw categorical columns directly — without any manual encoding — is genuinely useful, not just a convenience feature. CatBoost’s internal ordered target statistics are statistically more sound than anything you’d build manually with simple label encoding or naive target encoding.
What the Research Actually Says About Accuracy
The natural next question is: given all these differences, which one wins on accuracy?
An independent benchmark study by Florek and Zagdański (2023), published on arXiv, evaluated all four gradient boosting variants (including vanilla GBM) across twelve diverse real-world datasets using both randomized search and Bayesian hyperparameter optimization. The conclusion is worth quoting directly from the paper’s findings: after tuning, XGBoost, LightGBM, and CatBoost perform very similarly, with LightGBM showing the most consistent results across datasets.
The more interesting finding, which you won’t see in most comparison blog posts: an untuned baseline version of CatBoost frequently outperformed tuned versions of XGBoost and LightGBM. The authors attribute this to CatBoost’s ordered boosting mechanism producing more reliable gradient estimates from the start — the statistical bias reduction means the default model is already closer to optimal without searching the hyperparameter space.
This has a practical implication. If you’re in a situation where hyperparameter tuning time is limited — a common constraint in production work — CatBoost’s strong baseline is a real advantage. If you have time to tune and your dataset is large, LightGBM’s speed lets you explore the hyperparameter space more thoroughly in the same wall-clock time. If interpretability and a large existing ecosystem matter, XGBoost’s community and tooling breadth is hard to beat.
For a full treatment of why ensembles of trees outperform single trees, the random forest post covers the bagging side of this story. For understanding how individual trees make decisions, decision trees explained is the right starting point before diving into any boosting library.
The Hyperparameter Tuning Reality
All three libraries share a common challenge: they’re sensitive to hyperparameter choices. The Florek & Zagdański benchmark used Bayesian optimization (Tree-structured Parzen Estimator) for tuning and found it outperformed randomized search across all four frameworks, particularly on AUC score. This is consistent with practical experience — gradient boosting models have interacting hyperparameters where grid search misses the effective regions.
The parameters that matter most across all three:
Learning rate and number of trees interact directly. Lower learning rate means more trees are needed for the same performance, but generalisation tends to improve. The standard strategy is to set the learning rate low (0.01–0.05), use early stopping against a validation set to find the right number of trees, then freeze both.
Tree complexity is controlled differently across libraries. XGBoost uses max_depth. LightGBM’s primary control is num_leaves (not max_depth), because leaf-wise growth means depth alone doesn’t determine complexity. CatBoost uses depth, which is more like XGBoost’s max_depth given its symmetric tree structure.
Regularization is native to all three, but the parameter names differ. All three support L1 and L2 regularization on leaf weights. CatBoost’s l2_leaf_reg maps roughly to XGBoost’s reg_lambda.
Subsampling (row sampling per tree) and colsampling (feature sampling per tree) are available in all three. These are the stochastic gradient boosting techniques originally from Friedman’s 2002 paper, and they remain effective at reducing variance even in modern implementations.
# Shared tuning logic works across all three
from sklearn.model_selection import cross_val_score
import optuna
def objective(trial):
params = {
"n_estimators": 1000,
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1, log=True),
"max_depth": trial.suggest_int("max_depth", 3, 8),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-2, 10.0, log=True),
}
model = xgb.XGBRegressor(**params, random_state=42, verbosity=0)
return cross_val_score(model, X_train, y_train, cv=5, scoring="neg_rmse").mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)The Honest Decision Framework
Based on the research and practical experience:
Start with LightGBM if your dataset has more than ~100,000 rows or more than ~50 features. Its speed advantage from GOSS and EFB means you can do more experiments in the same time, which consistently matters more than small accuracy differences between libraries.
Use CatBoost when your dataset has meaningful categorical features that you’d otherwise have to encode manually, or when tuning time is limited and you need a strong baseline fast. Its untuned performance is genuinely better than XGBoost or LightGBM out of the box, and its internal handling of categoricals avoids the leakage risk that comes with naive target encoding.
Use XGBoost when you need maximum compatibility — it integrates with the widest range of tools, has the most extensive community documentation, and runs on every major deployment platform including edge environments where LightGBM’s newer features may not be available.
On a small-to-medium tabular dataset with no strong categorical signal, you’ll rarely find a meaningful accuracy gap between the three after tuning. Pick one, tune it properly, and don’t spend weeks benchmarking all three unless you have specific evidence the choice matters for your problem.
FAQ
Is XGBoost better than LightGBM?
In raw accuracy after hyperparameter tuning, the two are comparable. XGBoost has a broader ecosystem and is often the safer choice for production deployments with strict compatibility requirements. LightGBM trains significantly faster on large high-dimensional datasets due to its GOSS and EFB algorithms from the original NIPS 2017 paper. The practical choice depends more on dataset size and training time constraints than on accuracy differences.
When should you use CatBoost over XGBoost?
CatBoost is particularly worth using when you have meaningful categorical features that you’d otherwise process manually, or when you want a strong model with minimal hyperparameter tuning. Research benchmarks show that an untuned CatBoost model frequently outperforms tuned versions of XGBoost and LightGBM, which the original paper attributes to its ordered boosting mechanism avoiding the prediction shift bias present in other implementations.
What is gradient boosting in machine learning?
Gradient boosting is an ensemble technique that builds a predictive model by combining many weak learners — typically shallow decision trees — in sequence. Each tree is trained to correct the errors of the ensemble built so far, with the correction guided by the gradient of the loss function. The final prediction is the sum of all trees’ outputs, scaled by the learning rate. XGBoost, LightGBM, and CatBoost are all implementations of this framework with different engineering and statistical approaches underneath.
What does the learning rate do in gradient boosting?
In gradient boosting, the learning rate (also called shrinkage) scales the contribution of each new tree added to the ensemble. A smaller learning rate means each tree has less influence on the final prediction, which typically requires more trees to achieve the same training loss but produces a model that generalizes better. Chen and Guestrin’s original XGBoost paper explicitly describes this as leaving “space for future trees to improve the model.” In practice, values between 0.01 and 0.1 are most common, used alongside early stopping to find the optimal number of trees.
The core insight to take away is that these libraries didn’t arise from researchers independently re-implementing the same thing. Each one was a deliberate response to a specific limitation. XGBoost introduced second-order optimization and native regularization. LightGBM fixed the scan-all-data bottleneck with GOSS and EFB. CatBoost fixed a statistical bias that neither of the others addressed. Understanding why each exists makes your choice a reasoned one rather than a framework popularity contest.

