Gaussian Process Update Equations

When you start to learn about Gaussian processes, you come across the update equations fairly early on. The update equations are fairly intimidating to look at, and are typically dismissed as trivial to derive (for example, Rasmussen and Williams' simply point you towards a statistics book from the 1930's, which is neither available online nor in our university library...). I had a go at the derivation, and promptly realised it wasn't trivial at all from a cold start.

Fortuitously, at the PEN emulators workshop I recently attended, there was an introductory lecture from Jochen Voss, where he went through a univariate case, and then gave us the structure of the derivation for the full multivariate case. Luckily, this gave me the push I needed to try the derivation again, so I went away and filled in the gaps.

So here it is, in all it's glory; the derivation of the update equations for Gaussian processes.

The overall endgame of Gaussian process regression is to write down a conditional distribution P(y_p | y_t) for a set of predicted outputs y_p given a set of training (observed) outputs y_t. By the product rule

 P(y_p|y_t) = \frac{P(y_p,y_t)}{P(y_t)}

Since we have a constant set of data, P(y_t) is just a constant in this expression.

The underlying assumption in Gaussian process regression is that outputs are jointly Gaussian distributed, so that

 P(y_p|y_t) \propto P(y_p,y_t) \propto \exp\left[-\frac{1}{2} \left(\begin{array}{c} y_t \\ y_p \end{array}\right)^T\Sigma^{-1}\left(\begin{array}{c} y_t \\ y_p \end{array}\right)\right]

Where \Sigma is the joint covariance matrix. Remember that under the Gaussian process model we have trained a function which computes the elements of the Covariance matrix purely as a function of the inputs, it is only a function of the outputs y_p that we're trying to find. We can define the covariance matrix blockwise

 \Sigma = \left(\begin{array}{cc} T & C^T \\ C & P \end{array}\right)

Where T is the covariance matrix computed using only the training inputs x_t, P is the covariance matrix computed using the prediction inputs x_p, and C is the cross terms (i.e. the covariance \emph{between} y_t and y_p, computed using x_t and x_p). It is a well known result (it's in numerical recipes, or on Wikipedia) that you can blockwise invert a matrix;

 \Sigma^{-1} = \left(\begin{array}{cc}T^{-1} + T^{-1}C^T M CT^{-1} & -T^{-1}C^TM \\ -MCT^{-1} & M\end{array}\right)

Where M = (P-CT^{-1}C^T)^{-1}. So, we can directly compute our Gaussian density

 P(y_p|y_t) \propto \exp\left[-\frac{1}{2} y_t^T(T^{-1} + T^{-1}C^T M CT^{-1})y_t + \frac{1}{2}y_t^T (T^{-1}C^TM)y_p + \frac{1}{2}y_p^T (MCT^{-1})y_t - \frac{1}{2}y_p^TMy_p\right]

However, the only thing that isn't a constant here is y_p, so we can drop a bunch of terms (since we're only interested in the density, not absolute values)

 P(y_p|y_t) \propto \exp\left[\frac{1}{2}y_t^T (T^{-1}C^TM)y_p + \frac{1}{2}y_p^T (MCT^{-1})y_t - \frac{1}{2}y_p^TMy_p\right]

If we take the transpose of the middle term, we can group the terms together a bit more

 P(y_p|y_t) \propto \exp\left[\frac{1}{2}y_t^T (T^{-1}C^TM + (MCT^{-1})^T)y_p - \frac{1}{2}y_p^TMy_p\right]

Now, in general, a multivariate Gaussian has the form;

 \mathcal{N}(\tilde{y},\tilde{\Sigma}) \propto \exp\left[-\frac{1}{2}(y-\tilde{y})^T\tilde{\Sigma}^{-1}(y-\tilde{y})\right]

If we remember that covariance matrices are symmetric, we can expand, drop some constant terms and then rearrange this to

 \mathcal{N}(\tilde{y},\tilde{\Sigma}) \propto \exp\left[-\frac{1}{2}y^T\tilde{\Sigma}^{-1}y + \tilde{y}^T\tilde{\Sigma}^{-1}y\right]

we can therefore see that both P(y_p|y_t) and \mathcal{N}(\tilde{y},\tilde{\Sigma}) have exactly the same form. We can therefore straightforwardly match expressions for \tilde{\Sigma}.

 \tilde{\Sigma} = M^{-1} = P-CT^{-1}C^T

The expression for \tilde{y} requires a little more work. We start by matching terms

 \tilde{y}^T\tilde{\Sigma}^{-1} = \frac{1}{2}y_t^T (T^{-1}C^TM + (MCT^{-1})^T)

We can rearrange this a little bit

 \tilde{y}^T\tilde{\Sigma}^{-1} = \frac{1}{2}y_t^T T^{-1}C^T (M + M^T)

We know that M is a symmetric matrix (we just showed that its inverse is the covariance matrix). So, if we right multiply by the covariance matrix \tilde{\Sigma} and take the transpose, we finally arrive at

\tilde{y} = CT^{-1}y_t

And, so, in conclusion we know that

P(y_p|y_t) \sim \mathcal{N}(CT^{-1}y_t, P-CT^{-1}C^T)

So, not quite as trivial as the textbooks claim!

PhD Update - July 2016

So, that PhD month in review idea didn't work out so well... From now on I'll not be so ambitious and pledge to write a new blog post as and when I get time in between my real work.

In my April month in review I described random forests and Gaussian process regression which I was experimenting with in order to solve the problem of interpolation between population synthesis models. Back then, the project was in its infancy, and things have crystallised a great deal since then, so I'll do my best to give a summary of our progress.

Gaussian processes, whilst really cool, simply weren't suitable for this project, since the mapping from inputs to outputs is non-deterministic. I anticipate using them at some point for an extension of this project (keep your eyes peeled for that one). Similarly neural networks (which I promised to introduce in my review of May, sorry about that!) didn't really work out. Despite being really cool, they also struggle with non-deterministic functions. I did start to dabble with mixture density networks, however a stubborn bug in that code combined with the resounding success of a much simpler method led me to abandon it for the time being. That being said, I have an upcoming meeting about a project which may have me dusting off my neural network code. Watch this space.

k - Nearest Neighbours Regression (sort of)

Whilst reading around random forests, I realised that the way I was using them meant that I wasn't really doing random forests... I was doing bootstrap aggregated nearest neighbours regression, which is a bit of a mouthful but actually a very simple concept.

Nearest neighbours methods are quite popular in the machine learning community when it comes to discrete (classification) problems, where your observable outputs all fit nicely into discrete classes (cats, dogs, monkeys etc). Basically, you take the set of k training data whose inputs match the input of your test point as 'closely' as possible, and then combine their outputs in some way (e.g. majority voting).

Nearest neighbours regression takes this idea and applies it to continuous observable outputs, and uses some other combination (arithmetic mean, median etc) to combine the outputs into a point estimate.

Bootstrap aggregation ("bagging") is a means of squeezing down random fluctuations in your estimate (in much the same way as random forests do for decision trees). Basically, a nearest neighbours regression is performed on datasets drawn with replacement from the original dataset. My intuition for why this works is to imagine that I have a really really bad datapoint, that throws my estimate off. Using bootstrap aggregation, this datapoint will only show up ~2/3 of the time, so the final estimate will be less corrupted by it, even though the overall estimate will still converge on the truth for infinite data.

My Method

Notice that I said above that nearest neighbours regression combines outputs into a point estimate? This isn't what we want; for any particular combination of hyperparameters, the non-determinism of the binary evolution process means that there is a range of possible outputs, not a single set of numbers. Our problem is to interpolate between distributions of quantities, not point estimates.

Therefore, the method I have been employing is to densely sample hyperparameter space with binaries, and then to approximate the output distribution by taking the set of k binaries whose hyperparameters are closest  to the point we want to test. If we are sufficiently densely sampled (and we assume that whichever distribution we're interested in changes smoothly with the hyperparameters) then the set outputs from these k closest inputs will form a good approximation to distribution at that point. This is more or less what I was doing with random forests, but more efficient and interpretable.

It has been pointed out to me that this is in some sense similar to approximate Bayesian computation (ABC), except in reverse. ABC is a method of inference used in situations where the likelihood function is either too difficult or impossible to compute. Instead, you run a simulation for your sample point, and assess the output against your data via some distance measure (which is typically a bit of a dark art to define). Smaller distances are then proxies for high likelihoods. We're instead assessing binaries on the 'closeness' of their inputs, and then using their outputs as 'samples'.

I'm not completely sure how useful this analogy is, but it certainly sounds more impressive than simply saying 'nearest neighbours', as I demonstrated on my latest conference poster, which I presented at the intractable likelihood conference in Lancaster.

I'm still working on this method and how best to apply it to the population synthesis problem, so I'm reluctant to present any preliminary results yet, but they're coming (as is hopefully a paper, but I'm not holding my breath yet). Watch this space.

Other Stuff

I'm planning to write a complimentary blog post on the 'meta-work' I've been doing, developing a bunch of tools for dealing with large quantities of data. I've been messing around with HDF5 data formats, F2PY for wrapping python code around FORTRAN routines (which also involved learning FORTRAN...), as well as organising our group's journal club and a new thing; PhD book club, in which myself and a few colleagues collectively read through Thorne & Blandford's excellent textbook, before getting together to solve the exercises on the board.

I wont overdo it for this blog post, and I'll try to write another one soon.