Implementation of GP from Scratch
Mainly adapted from [Machine Learning: An Algorithmic Perspective, 2014]
Regression
The full Python code is here
It's desirable to let the optimization process search over different models as well as the parameters of the model. We can generalize the idea of a probability distribution to stochastic process, and it is simply a collection of random variables put together: instead of having a set of parameters that specify a probability distribution (such as the mean and covariance matrix for a multivariate Gaussian), we have a set of functions and a distribution over that set of functions. GPs are just smoothers, fitting a smooth curve through a set of datapoints. 
The basic procedure: compute the covariance matrix of the training data, and also the covariances between the training and test data, and the test data alone. Then compute the mean and covariance of the posterior distribution and sample from it.
Cholesky decomposition
[Machine Learning: An Algorithmic Perspective, 2014], p402
We use np.linalg to decompose a real-valued , symmetric and positive definite matrix  into the product , where  is a lower triangular matrix that only has non-zeros entries on and below the leading diagonal. It follows that . 
To solve , we first use forward substitution to find the  that solves . Then we use back-substitution to find the x that solves . 
[Gaussian Processes for Machine Learning, p37, GP pseudo-code]
The mean, f, and covariance, V can be computed as
L = np.linalg.cholesky(k)
# `beta` is `alpha` in [GP for ML] book
# `t` is `y` in [GP for ML] book
beta = np.linalg.solve(L.transpose(), np.linalg.solve(L,t))
kstar = kernel(data, xstar, theta, wantderiv=False, measnoise=0)
f = np.dot(kstar.transpose(), beta)
v = np.linalg.solve(L, kstar)
V = kernel(xstar, xstar, theta, wantderiv=False, measnoise=0 - np.dot(v.transpose(), v))
Learning the hyperparameters
is the signal variance, that controls the overall variance of the function; is a Gaussian noise added into the hidden function value; is the length-scale, that changes the degree of smoothing, trading it off against how well the curve matches the training data.
The squared exponential covariance matrix has three hyperparameters that need to be selected.
If the set of hyperparameters are labelled as $\theta$ then the ideal solution to this problem would be to set up some kind of prior distribution over the hyperparameters and then integrate them out in order to maximize the probability of the output targets:
[Machine Learning: An Algorithmic Perspective], p404
import scipy.optimize as so
result = so.fmin__cg(logPosterior, theta, fprime=gradLogPosterior, args=[(X,y)],
gtol=1e-4, maxiter=5, disp=1)
Classification
[Machine Learning: An Algorithmic Perspective] sometimes is slack about notations. In case of unclear notations, refer to [Gaussian Processes for Machine Learning*]
To squash the output, a, from a regression GP, we use , where  is a logistic function, and  is a hyperparameter and  is the variance.