ML Equations and definitions

Dataset and prediction

A supervised dataset is

D = {(x_{i}, y_{i})}_{i = 1}^{n} .

A model predicts

\overset{y}{^}_{i} = f_{h} e t a (x_{i}),

with parameters $h e t a$ . The residual/error for regression is often

r_{i} = y_{i} - \overset{y}{^}_{i} .

Empirical risk minimisation

Most supervised ML can be written as

\hat{ heta}=rg\min_ heta rac{1}{n}\sum_{i=1}^n \ell(f_ heta(x_i),y_i)+\lambda\Omega( heta),

where $ℓ$ is the loss and $Ω$ is an optional regularisation penalty.

Regression loss

Mean squared error is

\operatorname{MSE}= rac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2.

With Gaussian noise, minimising MSE is equivalent to maximising likelihood under a Normal distribution. Mean absolute error is

\operatorname{MAE}= rac{1}{n}\sum_{i=1}^n |y_i-\hat{y}_i|,

which is usually less sensitive to large outliers than MSE.

Classification loss

Binary logistic regression predicts

p(y=1\mid x)=\sigma(w^Tx+b)= rac{1}{1+e^{-(w^Tx+b)}}.

Binary cross-entropy is

- rac{1}{n}\sum_{i=1}^n \left[y_i\log p_i+(1-y_i)\log(1-p_i) ight].

For many classes, softmax converts class scores $z_{k}$ to probabilities:

p_k= rac{e^{z_k}}{\sum_j e^{z_j}}.

Linear and polynomial regression

Linear regression uses

\overset{y}{^} = w^{T} x + b .

For one input, polynomial regression augments the feature vector:

\hat{y}=eta_0+eta_1x+eta_2x^2+\cdots+eta_px^p.

It is still linear in the parameters $eta_j$ , even though it is nonlinear in $x$ .

SVM margin idea

A linear classifier uses a hyperplane

w^{T} x + b = 0.

For labels $y_{i} \in {- 1, + 1}$ , the hard-margin SVM tries to maximise the margin while satisfying

y_{i} (w^{T} x_{i} + b) \geq 1.

The geometric margin is proportional to $1/ ∥ w V er t$ . Support vectors are the training points that sit on or inside the margin boundaries and therefore determine the fitted boundary.

For regression, support vector regression fits a function while allowing an $ϵ$ -wide tube of errors that do not count heavily.

K-nearest neighbours

KNN uses a distance $d (x, x_{i})$ , often Euclidean distance. For classification,

\overset{y}{^} = mode {y_{i} : x_{i} e x t am o n g t h e k e x t n e a r es t n e i g hb o u r s} .

For regression,

\hat{y}= rac{1}{k}\sum_{i\in N_k(x)} y_i.

Decision trees and ensembles

A decision tree recursively partitions feature space. A classification leaf predicts the majority class or class proportions. A regression leaf predicts the average target value in that region.

A random forest averages many decorrelated trees:

\hat{f}_{ ext{forest}}(x)= rac{1}{B}\sum_{b=1}^B T_b(x)

for regression, and uses voting for classification.

Gradient boosting builds an additive model:

F_M(x)=\sum_{m=1}^M lpha_m h_m(x),

where each weak learner $h_{m}$ is fitted to reduce the previous errors/residuals.

Lasso and Ridge

Lasso regression adds an $L^{1}$ penalty:

J_{e x t l a sso} (w) = i \sum (y_{i} - w^{T} x_{i} - b)^{2} + λ ∥ w V er t_{1} .

This can shrink coefficients exactly to zero, so it performs feature selection.

Ridge regression adds an $L^{2}$ penalty:

J_{e x t r i d g e} (w) = i \sum (y_{i} - w^{T} x_{i} - b)^{2} + λ ∥ w V er t_{2}^{2} .

This shrinks coefficients toward zero without usually making them exactly zero, improving robustness when features are noisy or correlated.

Naive Bayes

Bayes’ rule is

P(A\mid B)= rac{P(B\mid A)P(A)}{P(B)}.

For classification,

\hat{y}=rg\max_c P(c)\prod_j P(x_j\mid c),

where the “naive” assumption is conditional independence of features given the class.

Linear discriminant analysis

LDA finds linear combinations of features that separate classes. In the common Gaussian-class view, classes have different means but share a covariance matrix, giving linear decision boundaries.

Gaussian process regression

Gaussian process regression places a prior over functions:

f (x) \sim G P (m (x), k (x, x^{'})) .

Predictions are distributions, not just point estimates, so the output includes a mean prediction and uncertainty band such as a 95% confidence/credible interval.

Optimisation

Gradient descent updates parameters by

h e t a_{t + 1} = h e t a_{t} - η ab l a_{h} e t a J (h e t a_{t}),

where $η$ is the learning rate. Stochastic or mini-batch gradient descent estimates the gradient using a subset of the data.

Knowledge Garden

Explorer

ML Equations and definitions

ML Equations and definitions

Dataset and prediction

Empirical risk minimisation

Regression loss

Classification loss

Linear and polynomial regression

SVM margin idea

K-nearest neighbours

Decision trees and ensembles

Lasso and Ridge

Naive Bayes

Linear discriminant analysis

Gaussian process regression

Optimisation

Graph View

Table of Contents

Backlinks