ML Equations and definitions

Dataset and prediction

A supervised dataset is

A model predicts

with parameters . The residual/error for regression is often

Empirical risk minimisation

Most supervised ML can be written as

\hat{ heta}=rg\min_ heta rac{1}{n}\sum_{i=1}^n \ell(f_ heta(x_i),y_i)+\lambda\Omega( heta),

where is the loss and is an optional regularisation penalty.

Regression loss

Mean squared error is

\operatorname{MSE}= rac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2.

With Gaussian noise, minimising MSE is equivalent to maximising likelihood under a Normal distribution. Mean absolute error is

\operatorname{MAE}= rac{1}{n}\sum_{i=1}^n |y_i-\hat{y}_i|,

which is usually less sensitive to large outliers than MSE.

Classification loss

Binary logistic regression predicts

p(y=1\mid x)=\sigma(w^Tx+b)= rac{1}{1+e^{-(w^Tx+b)}}.

Binary cross-entropy is

- rac{1}{n}\sum_{i=1}^n \left[y_i\log p_i+(1-y_i)\log(1-p_i) ight].

For many classes, softmax converts class scores to probabilities:

p_k= rac{e^{z_k}}{\sum_j e^{z_j}}.

Linear and polynomial regression

Linear regression uses

For one input, polynomial regression augments the feature vector:

\hat{y}=eta_0+eta_1x+eta_2x^2+\cdots+eta_px^p.

It is still linear in the parameters eta_j, even though it is nonlinear in .

SVM margin idea

A linear classifier uses a hyperplane

For labels , the hard-margin SVM tries to maximise the margin while satisfying

The geometric margin is proportional to . Support vectors are the training points that sit on or inside the margin boundaries and therefore determine the fitted boundary.

For regression, support vector regression fits a function while allowing an -wide tube of errors that do not count heavily.

K-nearest neighbours

KNN uses a distance , often Euclidean distance. For classification,

For regression,

\hat{y}= rac{1}{k}\sum_{i\in N_k(x)} y_i.

Decision trees and ensembles

A decision tree recursively partitions feature space. A classification leaf predicts the majority class or class proportions. A regression leaf predicts the average target value in that region.

A random forest averages many decorrelated trees:

\hat{f}_{ ext{forest}}(x)= rac{1}{B}\sum_{b=1}^B T_b(x)

for regression, and uses voting for classification.

Gradient boosting builds an additive model:

F_M(x)=\sum_{m=1}^M lpha_m h_m(x),

where each weak learner is fitted to reduce the previous errors/residuals.

Lasso and Ridge

Lasso regression adds an penalty:

This can shrink coefficients exactly to zero, so it performs feature selection.

Ridge regression adds an penalty:

This shrinks coefficients toward zero without usually making them exactly zero, improving robustness when features are noisy or correlated.

Naive Bayes

Bayes’ rule is

P(A\mid B)= rac{P(B\mid A)P(A)}{P(B)}.

For classification,

\hat{y}=rg\max_c P(c)\prod_j P(x_j\mid c),

where the “naive” assumption is conditional independence of features given the class.

Linear discriminant analysis

LDA finds linear combinations of features that separate classes. In the common Gaussian-class view, classes have different means but share a covariance matrix, giving linear decision boundaries.

Gaussian process regression

Gaussian process regression places a prior over functions:

Predictions are distributions, not just point estimates, so the output includes a mean prediction and uncertainty band such as a 95% confidence/credible interval.

Optimisation

Gradient descent updates parameters by

where is the learning rate. Stochastic or mini-batch gradient descent estimates the gradient using a subset of the data.