ML Common pitfalls

Classification/regression confusion

Do not choose the algorithm just from the name. Logistic regression is mainly a classification method. Lasso and Ridge are usually regression regularisers, but the same penalties can appear in classifiers. First ask: is the target categorical or continuous?

Assuming linear separability

The cheat sheet notes that classification data may be linearly separable, but this might not be true. A straight hyperplane is a useful mental model for SVMs, logistic regression, and LDA, but many real boundaries curve or have interactions. Check residuals, decision boundaries, and validation performance before trusting a linear model.

Overfitting complex models

Complex models can memorise training data. Decision trees, random forests, gradient boosting, high-degree polynomial regression, and kernel methods can all overfit if left unconstrained. Use validation data, cross-validation, early stopping, regularisation, pruning, or simpler baselines.

Outliers in regression

The PDF explicitly flags that outliers can significantly affect regression lines and predictions. Squared error gives large residuals huge influence. Plot residuals, inspect leverage points, and consider robust losses, transformations, or domain-based filtering.

Blindly assuming linearity

Linear regression and Ridge/Lasso are powerful baselines, but linearity may be wrong. Residuals that curve, fan out, or cluster suggest missing nonlinear terms, interactions, heteroscedastic noise, or a bad feature representation.

Feature scaling traps

KNN, SVMs, Ridge/Lasso, logistic regression, and gradient descent are sensitive to feature scale. If one feature is measured in thousands and another in decimals, distances and penalties can be dominated by the large-scale feature. Fit scaling on the training set only to avoid leakage.

Data leakage

Information from validation/test data must not enter training, preprocessing, feature selection, imputation, scaling, or hyperparameter selection. Leakage gives fake performance and gets you absolutely mogged when the model meets new data.

Metric mismatch

Accuracy can be misleading for imbalanced classification. Regression MSE can be dominated by rare huge errors. Match the metric to the real cost: precision/recall/F1 for rare-event classification, calibration for probabilities, MAE/RMSE/residual plots for regression.

Misreading probabilities

A classifier’s output score is not automatically a calibrated probability. Logistic regression is often better calibrated than boosted trees or neural networks, but calibration should still be checked if decisions depend on probability values.

Naive Bayes independence assumption

Naive Bayes assumes features are conditionally independent given the class. This is rarely exactly true. It can still be a strong baseline, but correlated features can make probabilities overconfident.

KNN scaling and dimensionality

KNN sounds simple: look at neighbours. But in high dimensions, distances become less informative, irrelevant features hurt badly, and prediction can be slow because the training set must be searched.

Tree/forest interpretability overclaim

A single small tree is interpretable. A large random forest is less so. Feature importances can be biased and do not prove causation.

Gaussian-process cost and kernel dependence

GPR gives uncertainty, which is lovely, but it depends heavily on the kernel and can scale badly with training-set size. Do not treat the 95% interval as magic truth; it is conditional on the model assumptions.

Bad validation hygiene

Validation guides model choice; the final test set estimates performance after choices are fixed. If you repeatedly tune based on the test set, it becomes another validation set and the reported result is optimistic.

Knowledge Garden

Explorer

ML Common pitfalls

ML Common pitfalls

Classification/regression confusion

Assuming linear separability

Overfitting complex models

Outliers in regression

Blindly assuming linearity

Feature scaling traps

Data leakage

Metric mismatch

Misreading probabilities

Naive Bayes independence assumption

KNN scaling and dimensionality

Tree/forest interpretability overclaim

Gaussian-process cost and kernel dependence

Bad validation hygiene

Graph View

Table of Contents

Backlinks