ML Equations and definitions
Dataset and prediction
A supervised dataset is
A model predicts
with parameters . The residual/error for regression is often
Empirical risk minimisation
Most supervised ML can be written as
\hat{ heta}=rg\min_ heta rac{1}{n}\sum_{i=1}^n \ell(f_ heta(x_i),y_i)+\lambda\Omega( heta),where is the loss and is an optional regularisation penalty.
Regression loss
Mean squared error is
\operatorname{MSE}=rac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2.With Gaussian noise, minimising MSE is equivalent to maximising likelihood under a Normal distribution. Mean absolute error is
\operatorname{MAE}=rac{1}{n}\sum_{i=1}^n |y_i-\hat{y}_i|,which is usually less sensitive to large outliers than MSE.
Classification loss
Binary logistic regression predicts
p(y=1\mid x)=\sigma(w^Tx+b)=rac{1}{1+e^{-(w^Tx+b)}}.Binary cross-entropy is
-rac{1}{n}\sum_{i=1}^n \left[y_i\log p_i+(1-y_i)\log(1-p_i) ight].For many classes, softmax converts class scores to probabilities:
p_k=rac{e^{z_k}}{\sum_j e^{z_j}}.Linear and polynomial regression
Linear regression uses
For one input, polynomial regression augments the feature vector:
\hat{y}=eta_0+eta_1x+eta_2x^2+\cdots+eta_px^p.It is still linear in the parameters eta_j, even though it is nonlinear in .
SVM margin idea
A linear classifier uses a hyperplane
For labels , the hard-margin SVM tries to maximise the margin while satisfying
The geometric margin is proportional to . Support vectors are the training points that sit on or inside the margin boundaries and therefore determine the fitted boundary.
For regression, support vector regression fits a function while allowing an -wide tube of errors that do not count heavily.
K-nearest neighbours
KNN uses a distance , often Euclidean distance. For classification,
For regression,
\hat{y}=rac{1}{k}\sum_{i\in N_k(x)} y_i.Decision trees and ensembles
A decision tree recursively partitions feature space. A classification leaf predicts the majority class or class proportions. A regression leaf predicts the average target value in that region.
A random forest averages many decorrelated trees:
\hat{f}_{ ext{forest}}(x)=rac{1}{B}\sum_{b=1}^B T_b(x)for regression, and uses voting for classification.
Gradient boosting builds an additive model:
F_M(x)=\sum_{m=1}^M lpha_m h_m(x),where each weak learner is fitted to reduce the previous errors/residuals.
Lasso and Ridge
Lasso regression adds an penalty:
This can shrink coefficients exactly to zero, so it performs feature selection.
Ridge regression adds an penalty:
This shrinks coefficients toward zero without usually making them exactly zero, improving robustness when features are noisy or correlated.
Naive Bayes
Bayes’ rule is
P(A\mid B)=rac{P(B\mid A)P(A)}{P(B)}.For classification,
\hat{y}=rg\max_c P(c)\prod_j P(x_j\mid c),where the “naive” assumption is conditional independence of features given the class.
Linear discriminant analysis
LDA finds linear combinations of features that separate classes. In the common Gaussian-class view, classes have different means but share a covariance matrix, giving linear decision boundaries.
Gaussian process regression
Gaussian process regression places a prior over functions:
Predictions are distributions, not just point estimates, so the output includes a mean prediction and uncertainty band such as a 95% confidence/credible interval.
Optimisation
Gradient descent updates parameters by
where is the learning rate. Stochastic or mini-batch gradient descent estimates the gradient using a subset of the data.