ML Common pitfalls
Classification/regression confusion
Do not choose the algorithm just from the name. Logistic regression is mainly a classification method. Lasso and Ridge are usually regression regularisers, but the same penalties can appear in classifiers. First ask: is the target categorical or continuous?
Assuming linear separability
The cheat sheet notes that classification data may be linearly separable, but this might not be true. A straight hyperplane is a useful mental model for SVMs, logistic regression, and LDA, but many real boundaries curve or have interactions. Check residuals, decision boundaries, and validation performance before trusting a linear model.
Overfitting complex models
Complex models can memorise training data. Decision trees, random forests, gradient boosting, high-degree polynomial regression, and kernel methods can all overfit if left unconstrained. Use validation data, cross-validation, early stopping, regularisation, pruning, or simpler baselines.
Outliers in regression
The PDF explicitly flags that outliers can significantly affect regression lines and predictions. Squared error gives large residuals huge influence. Plot residuals, inspect leverage points, and consider robust losses, transformations, or domain-based filtering.
Blindly assuming linearity
Linear regression and Ridge/Lasso are powerful baselines, but linearity may be wrong. Residuals that curve, fan out, or cluster suggest missing nonlinear terms, interactions, heteroscedastic noise, or a bad feature representation.
Feature scaling traps
KNN, SVMs, Ridge/Lasso, logistic regression, and gradient descent are sensitive to feature scale. If one feature is measured in thousands and another in decimals, distances and penalties can be dominated by the large-scale feature. Fit scaling on the training set only to avoid leakage.
Data leakage
Information from validation/test data must not enter training, preprocessing, feature selection, imputation, scaling, or hyperparameter selection. Leakage gives fake performance and gets you absolutely mogged when the model meets new data.
Metric mismatch
Accuracy can be misleading for imbalanced classification. Regression MSE can be dominated by rare huge errors. Match the metric to the real cost: precision/recall/F1 for rare-event classification, calibration for probabilities, MAE/RMSE/residual plots for regression.
Misreading probabilities
A classifier’s output score is not automatically a calibrated probability. Logistic regression is often better calibrated than boosted trees or neural networks, but calibration should still be checked if decisions depend on probability values.
Naive Bayes independence assumption
Naive Bayes assumes features are conditionally independent given the class. This is rarely exactly true. It can still be a strong baseline, but correlated features can make probabilities overconfident.
KNN scaling and dimensionality
KNN sounds simple: look at neighbours. But in high dimensions, distances become less informative, irrelevant features hurt badly, and prediction can be slow because the training set must be searched.
Tree/forest interpretability overclaim
A single small tree is interpretable. A large random forest is less so. Feature importances can be biased and do not prove causation.
Gaussian-process cost and kernel dependence
GPR gives uncertainty, which is lovely, but it depends heavily on the kernel and can scale badly with training-set size. Do not treat the 95% interval as magic truth; it is conditional on the model assumptions.
Bad validation hygiene
Validation guides model choice; the final test set estimates performance after choices are fixed. If you repeatedly tune based on the test set, it becomes another validation set and the reported result is optimistic.