When you start to learn about Gaussian processes, you come across the update equations fairly early on. The update equations are fairly intimidating to look at, and are typically dismissed as trivial to derive (for example, Rasmussen and Williams' simply point you towards a statistics book from the 1930's, which is neither available online nor in our university library...). I had a go at the derivation, and promptly realised it wasn't trivial at all from a cold start.

Fortuitously, at the PEN emulators workshop I recently attended, there was an introductory lecture from Jochen Voss, where he went through a univariate case, and then gave us the structure of the derivation for the full multivariate case. Luckily, this gave me the push I needed to try the derivation again, so I went away and filled in the gaps.

So here it is, in all it's glory; the derivation of the update equations for Gaussian processes.

The overall endgame of Gaussian process regression is to write down a conditional distribution for a set of predicted outputs given a set of training (observed) outputs . By the product rule

Since we have a constant set of data, is just a constant in this expression.

The underlying assumption in Gaussian process regression is that outputs are jointly Gaussian distributed, so that

Where is the joint covariance matrix. Remember that under the Gaussian process model we have trained a function which computes the elements of the Covariance matrix purely as a function of the inputs, it is only a function of the outputs that we're trying to find. We can define the covariance matrix blockwise

Where is the covariance matrix computed using only the training inputs , is the covariance matrix computed using the prediction inputs , and is the cross terms (i.e. the covariance \emph{between} and , computed using and ). It is a well known result (it's in numerical recipes, or on Wikipedia) that you can blockwise invert a matrix;

Where . So, we can directly compute our Gaussian density

However, the only thing that isn't a constant here is , so we can drop a bunch of terms (since we're only interested in the density, not absolute values)

If we take the transpose of the middle term, we can group the terms together a bit more

Now, in general, a multivariate Gaussian has the form;

If we remember that covariance matrices are symmetric, we can expand, drop some constant terms and then rearrange this to

we can therefore see that both and have exactly the same form. We can therefore straightforwardly match expressions for .

The expression for requires a little more work. We start by matching terms

We can rearrange this a little bit

We know that is a symmetric matrix (we just showed that its inverse is the covariance matrix). So, if we right multiply by the covariance matrix and take the transpose, we finally arrive at

And, so, in conclusion we know that

So, not quite as trivial as the textbooks claim!