ML Models from cheatsheet
This note captures the model list from ML_cheatsheat in text form. The PDF is organised around classification versus regression, then lists combined model families and task-specific models.
Classification vs regression
Classification predicts categorical labels from known labels. Regression predicts continuous values from known values. Both are supervised learning because the training examples include inputs and targets.
Classification is good when the output is naturally a class: spam/not spam, image class, purchase yes/no, bacteria/virus, or Class A/B/C. Regression is good when the output is a continuous number: house price, weight, sales, temperature, or a physical measurement.
Support vector machines
For classification, an SVM finds the hyperplane that best separates classes by maximising the margin. The important geometry is: positive hyperplane, negative hyperplane, maximum-margin hyperplane, and support vectors. The support vectors are the points that matter most for the boundary.
For regression, support vector methods fit a hyperplane/function that minimises prediction error while often ignoring errors inside a tolerance tube. SVMs can work well in high-dimensional spaces, especially with kernels, but the linear-separability assumption in the simplest picture is a limitation.
K-nearest neighbours
KNN is similarity-based. The prediction for a new point comes from the nearest training points.
- Classification: vote among nearby labels.
- Regression: average the numerical values of nearby neighbours.
KNN is intuitive and non-parametric, but it is sensitive to feature scaling, irrelevant features, distance choice, and the value of .
Decision trees
A decision tree recursively splits the input space using feature tests. The tree has a root node, decision nodes, subtrees, and leaf nodes.
- Classification tree: each leaf corresponds to a class prediction or class distribution.
- Regression tree: each leaf predicts the average target value of the training samples in that region.
Trees are interpretable and handle nonlinear structure, but single trees can overfit and be unstable.
Random forests
A random forest is an ensemble of decision trees. Each tree is trained on a randomised subset/sample and often a random subset of features.
- Classification: each tree votes for a class; the majority vote is the final prediction.
- Regression: each tree predicts a continuous value; the final prediction is the average.
The PDF diagram’s flow is: training set → multiple training data samples → decision trees → voting/averaging → prediction.
Gradient boosting
Gradient boosting builds an ensemble sequentially. Each new weak learner, often a decision tree, corrects errors made by the previous learners.
- Classification: weak classifiers are combined, often by a weighted sum, to improve class predictions.
- Regression: each new model is fitted to residual errors from the previous model, refining the continuous prediction.
Boosting is often very strong on tabular data, but can overfit if the learning rate, tree depth, number of estimators, or validation scheme are poor.
Lasso regression
Lasso penalises the absolute size of coefficients. For regression, it minimises squared error plus an penalty. The key effect is sparsity: some coefficients shrink exactly to zero, excluding irrelevant features.
The PDF also mentions using this idea to simplify a classification model and improve accuracy by removing irrelevant features. More generally, Lasso is a regularisation/feature-selection technique that can be used inside linear models.
Ridge regression
Ridge penalises the square of coefficients. For regression, it minimises squared error plus an penalty. The key effect is shrinkage: coefficients are pushed toward zero, reducing model complexity and improving robustness to noise or correlated features.
The PDF writes the idea as a least-squares line modified by a penalty term like . More generally, for many coefficients the penalty is .
Logistic regression
Logistic regression is a classification model, despite the name “regression”. It models the probability of a binary outcome from input variables. The PDF example uses user age, income, and gender to predict purchase yes/no.
The model outputs a probability; a threshold turns that probability into a class label. This is useful because the threshold can be adjusted when false positives and false negatives have different costs.
Linear discriminant analysis
Linear discriminant analysis is used for classification and dimensionality reduction. It finds linear combinations of features that best separate multiple classes. The PDF’s medical-style example shows features such as CRP and temperature being projected into discriminant scores to separate bacteria from virus.
LDA is closely linked to Gaussian class models with shared covariance. It is simple and interpretable, but its assumptions can be too clean for messy data.
Naive Bayes classifier
Naive Bayes calculates probabilities for each class and chooses the class with the highest probability. It uses Bayes’ theorem:
P(A\mid B)=rac{P(B\mid A)P(A)}{P(B)}.The “naive” part is the assumption that features are conditionally independent given the class. That assumption is often false, but the classifier can still work surprisingly well, especially for text-like count features.
Linear regression
Linear regression finds the best-fitting straight line or hyperplane through data points to predict the relationship between independent variables and a dependent variable. The PDF example is house price versus size.
It is a strong baseline because the coefficients are interpretable, but it can fail when the relationship is curved, interaction-heavy, or dominated by outliers.
Polynomial regression
Polynomial regression fits a polynomial equation to capture nonlinear relationships. For one input, a quadratic model has the form
y=eta_0+eta_1x+eta_2x^2.It is useful when residuals from a straight-line fit show curvature. The trap is high-degree polynomials: they can wiggle wildly and overfit.
Gaussian process regression
Gaussian process regression uses probability to predict values for observed and new data points, while also giving uncertainty for each prediction. The PDF’s example shows a prediction curve with a 95% confidence interval around observations.
GPR is useful when uncertainty matters and data are not huge. It depends strongly on the kernel choice and can become computationally expensive as the number of training points grows.