Implementation of GP from Scratch

Mainly adapted from [Machine Learning: An Algorithmic Perspective, 2014]

Regression

The full Python code is here

It's desirable to let the optimization process search over different models as well as the parameters of the model. We can generalize the idea of a probability distribution to stochastic process, and it is simply a collection of random variables put together: instead of having a set of parameters that specify a probability distribution (such as the mean and covariance matrix for a multivariate Gaussian), we have a set of functions and a distribution over that set of functions. GPs are just smoothers, fitting a smooth curve through a set of datapoints.

The basic procedure: compute the covariance matrix of the training data, and also the covariances between the training and test data, and the test data alone. Then compute the mean and covariance of the posterior distribution and sample from it.

Cholesky decomposition

[Machine Learning: An Algorithmic Perspective, 2014], p402

We use np.linalg to decompose a real-valued , symmetric and positive definite matrix into the product , where is a lower triangular matrix that only has non-zeros entries on and below the leading diagonal. It follows that .

To solve , we first use forward substitution to find the that solves . Then we use back-substitution to find the x that solves .

[Gaussian Processes for Machine Learning, p37, GP pseudo-code]

The mean, f, and covariance, V can be computed as

L = np.linalg.cholesky(k)
# `beta` is `alpha` in [GP for ML] book
# `t` is `y` in [GP for ML] book
beta = np.linalg.solve(L.transpose(), np.linalg.solve(L,t))
kstar = kernel(data, xstar, theta, wantderiv=False, measnoise=0)
f = np.dot(kstar.transpose(), beta)
v = np.linalg.solve(L, kstar)
V = kernel(xstar, xstar, theta, wantderiv=False, measnoise=0 - np.dot(v.transpose(), v))

Learning the hyperparameters

is the signal variance, that controls the overall variance of the function; is a Gaussian noise added into the hidden function value; is the length-scale, that changes the degree of smoothing, trading it off against how well the curve matches the training data.

The squared exponential covariance matrix has three hyperparameters that need to be selected.

If the set of hyperparameters are labelled as $\theta$ then the ideal solution to this problem would be to set up some kind of prior distribution over the hyperparameters and then integrate them out in order to maximize the probability of the output targets:

[Machine Learning: An Algorithmic Perspective], p404

import scipy.optimize as so
result = so.fmin__cg(logPosterior, theta, fprime=gradLogPosterior, args=[(X,y)],
gtol=1e-4, maxiter=5, disp=1)

Classification

[Machine Learning: An Algorithmic Perspective] sometimes is slack about notations. In case of unclear notations, refer to [Gaussian Processes for Machine Learning*]

To squash the output, a, from a regression GP, we use , where is a logistic function, and is a hyperparameter and is the variance.

results matching ""

    No results matching ""