ML Key concepts
Supervised learning
Supervised learning develops predictive models from both input data and output data. The training set contains pairs : is the feature vector and is the target. The target determines the task type.
- Classification: is a category, e.g. spam/not spam, image class, disease class, “purchase yes/no”.
- Regression: is a continuous value, e.g. price, weight, temperature, flux, risk score, or a measured physical quantity.
The PDF’s central split is exactly this: classification predicts labels from known labels; regression predicts continuous values from known values.
Classification
Classification maps an input to a categorical label or to probabilities over labels. It is useful when the desired output is discrete and interpretable, such as “Class A vs Class B” or “yes vs no”. Many classifiers output a score first, then a decision rule converts the score into a label.
Advantages from the cheat sheet:
- Results can be easy to interpret because each input receives a clear class.
- It matches categorical-output problems such as spam detection, purchase prediction, image recognition, and medical diagnosis.
Disadvantages / traps:
- Some simple classifiers assume the data are linearly separable; real data often are not.
- Complex classifiers can overfit and generalise badly if validation is weak.
- Accuracy alone can hide poor performance on rare classes.
Regression
Regression maps an input to a continuous value. The output is not a class but a number. The cheat sheet emphasises that regression can model a wide variety of relationships, not only straight lines, and is suitable whenever the target is continuous.
Advantages:
- Natural for quantities: price, size, weight, time, distance, physical fields, brightness, or risk.
- Can be linear, polynomial, non-parametric, tree-based, kernel-based, or probabilistic.
Disadvantages / traps:
- Outliers can strongly affect some fitted lines and predictions.
- A linearity assumption can be too restrictive.
- Residual patterns often reveal missing structure.
Model families in the cheat sheet
- Support vector machines: find a separating or fitting hyperplane, with the classification version maximising margin.
- K-nearest neighbours: predict from the nearest training examples; classification votes, regression averages.
- Decision trees: recursively split feature space; leaves hold class labels or average target values.
- Random forests: combine many trees trained on randomised data/features; classification votes, regression averages.
- Gradient boosting: add weak models sequentially so each new model corrects previous errors/residuals.
- Lasso: adds an penalty, often shrinking unhelpful coefficients to exactly zero.
- Ridge: adds an penalty, shrinking coefficients smoothly toward zero.
- Logistic regression: binary/multiclass classification via probabilities.
- Linear discriminant analysis: finds linear feature combinations that separate classes.
- Naive Bayes: uses Bayes’ theorem with simplifying conditional-independence assumptions.
- Linear regression: best-fitting straight-line/linear relationship.
- Polynomial regression: fits curved relationships by using powers of the input as features.
- Gaussian process regression: probabilistic regression that gives predictions with uncertainty intervals.
Generalisation
Generalisation means doing well on unseen data. Overfitting means training error is low but validation/test error is high. Underfitting means the model is too simple to capture the structure even on training data. The practical loop is: choose a baseline, split the data honestly, train, tune on validation data, inspect errors, then report final performance on a held-out test set.
Regularisation
Regularisation adds a preference for simpler models. Lasso and Ridge are the canonical cheat-sheet examples: Lasso penalises absolute coefficient size and can do feature selection; Ridge penalises squared coefficient size and reduces sensitivity to noise.
Probabilistic view
Many losses are negative log-likelihoods from Stats Equations and definitions. Squared error corresponds to Gaussian noise; cross-entropy corresponds to categorical likelihoods; Naive Bayes explicitly uses Bayes’ theorem; Gaussian process regression directly models uncertainty.