ML Key concepts

Supervised learning

Supervised learning develops predictive models from both input data and output data. The training set contains pairs : is the feature vector and is the target. The target determines the task type.

  • Classification: is a category, e.g. spam/not spam, image class, disease class, “purchase yes/no”.
  • Regression: is a continuous value, e.g. price, weight, temperature, flux, risk score, or a measured physical quantity.

The PDF’s central split is exactly this: classification predicts labels from known labels; regression predicts continuous values from known values.

Classification

Classification maps an input to a categorical label or to probabilities over labels. It is useful when the desired output is discrete and interpretable, such as “Class A vs Class B” or “yes vs no”. Many classifiers output a score first, then a decision rule converts the score into a label.

Advantages from the cheat sheet:

  • Results can be easy to interpret because each input receives a clear class.
  • It matches categorical-output problems such as spam detection, purchase prediction, image recognition, and medical diagnosis.

Disadvantages / traps:

  • Some simple classifiers assume the data are linearly separable; real data often are not.
  • Complex classifiers can overfit and generalise badly if validation is weak.
  • Accuracy alone can hide poor performance on rare classes.

Regression

Regression maps an input to a continuous value. The output is not a class but a number. The cheat sheet emphasises that regression can model a wide variety of relationships, not only straight lines, and is suitable whenever the target is continuous.

Advantages:

  • Natural for quantities: price, size, weight, time, distance, physical fields, brightness, or risk.
  • Can be linear, polynomial, non-parametric, tree-based, kernel-based, or probabilistic.

Disadvantages / traps:

  • Outliers can strongly affect some fitted lines and predictions.
  • A linearity assumption can be too restrictive.
  • Residual patterns often reveal missing structure.

Model families in the cheat sheet

  • Support vector machines: find a separating or fitting hyperplane, with the classification version maximising margin.
  • K-nearest neighbours: predict from the nearest training examples; classification votes, regression averages.
  • Decision trees: recursively split feature space; leaves hold class labels or average target values.
  • Random forests: combine many trees trained on randomised data/features; classification votes, regression averages.
  • Gradient boosting: add weak models sequentially so each new model corrects previous errors/residuals.
  • Lasso: adds an penalty, often shrinking unhelpful coefficients to exactly zero.
  • Ridge: adds an penalty, shrinking coefficients smoothly toward zero.
  • Logistic regression: binary/multiclass classification via probabilities.
  • Linear discriminant analysis: finds linear feature combinations that separate classes.
  • Naive Bayes: uses Bayes’ theorem with simplifying conditional-independence assumptions.
  • Linear regression: best-fitting straight-line/linear relationship.
  • Polynomial regression: fits curved relationships by using powers of the input as features.
  • Gaussian process regression: probabilistic regression that gives predictions with uncertainty intervals.

Generalisation

Generalisation means doing well on unseen data. Overfitting means training error is low but validation/test error is high. Underfitting means the model is too simple to capture the structure even on training data. The practical loop is: choose a baseline, split the data honestly, train, tune on validation data, inspect errors, then report final performance on a held-out test set.

Regularisation

Regularisation adds a preference for simpler models. Lasso and Ridge are the canonical cheat-sheet examples: Lasso penalises absolute coefficient size and can do feature selection; Ridge penalises squared coefficient size and reduces sensitivity to noise.

Probabilistic view

Many losses are negative log-likelihoods from Stats Equations and definitions. Squared error corresponds to Gaussian noise; cross-entropy corresponds to categorical likelihoods; Naive Bayes explicitly uses Bayes’ theorem; Gaussian process regression directly models uncertainty.