Fisher matrices

Back in December, I put a paper on the arxiv about Fisher matrices. You can see my blog post about it here. I've since put together a short Jupyter notebook detailing how we actually went about calculating each Fisher matrix, which you can see here.

I was using quite an old version of COMPAS for the results in the paper, where the output format wasn't written very well, thus all of the horrible bookkeeping code. I've since rewritten that part of COMPAS to enforce some better standards.

(un-)Predictability of Sports

A subject of debate (mostly in the pub) over the last few months has been about the predictability of sports results. My hypothesis was that the total number of points scored in a team-vs-team sport should be roughly Poisson distributed, and as such you would expect greater statistical fluctuations in the scores of sports where the score-lines are typically low (e.g. football), compared to sports with a high scoreline (e.g. Basketball). My thinking was that the "better" team should win more consistently in high scoreline sports.

Over the Christmas break, I decided to put my money where my mouth is, and actually investigate it. You can find the Jupyter notebook here. I downloaded a bunch of historical results from football, ice hockey and basketball, together with bookmakers pre-game odds (from this excellent site). The data took a bit of cleaning, but I ultimately looked at how often the bookie's favourite lost for each sport. The tl;dr conclusion is that I was wrong. Football is the most accurately predicted, and has the lowest average scoreline.


I had intended to clean the notebook up a little bit, and make some more investigations (e.g. was I wrong because the score-lines aren't Poisson distributed? Or because each team's individual scoreline are too highly correlated with each other?) but whilst I was clearing out some of my external hard-drives I accidentally formatted my internal hard-drive too. That was a lot of fun to fix.... Anyway, I lost all my squeaky clean data, so I couldn't make prettier plots. Apologies.

New Paper - Accuracy of Inference

A couple of weeks ago I put my newest paper on to the arXiv, which you can find here;

The paper describes how much we can hope to learn from gravitational-wave observations of black holes, focusing in particular on how well we can constrain our models of the evolution of binary stars. In order to calculate this I used COMPAS, the binary evolution simulation software we are developing here in Birmingham, and a technique called Fisher information matrices.

If you have a model, which depends on some parameters \{\lambda\}, and makes predictions about some observable random variable, then the Fisher information quantifies how sensitive the expected observations are to changes in the underlying parameters. If some parameters have a large effect on our expected observations, then it makes sense that we can learn a lot about them.

I calculated the Fisher matrix by running COMPAS (a lot!) and seeing how the output changed with respect to four different bits of physics; the strength of supernova kicks, the ferocity of stellar winds during two different stellar evolutionary phases, and the frictional energetics during the common envelope phase. Below is the money plot from the paper.

The plot shows how accurately we will be able to measure each of these parameters after 1000 observations of binary black holes. The fuzziness comes from a quantification of the various sources of uncertainty in this process. It also shows how the different parameters behave with respect to each other (their correlation). The way I like to think about this plot is in terms of null hypothesis testing. If we find the truth is a massive outlier with respect to these distributions, then we reject the null hypothesis that our model is correct. This plot shows that the model doesn't have to be wrong by more than a few percent before we can rule out our model.

Intense Collaboration à Clermont-Ferrand

I've spent the last week working in the beautiful (and intensely hot) Clermont-Ferrand in the south of France, in the fourth edition of the COIN residence program. These residence programs are aimed to bring together a diverse collection of astrophysicists (and others!) to live together in a house for a week, and collaborate intensely on a couple of statistical and/or machine learning problem in astrophysics. The house itself is super luxurious (by British standards, at least)

I had no idea how much work could be done in a single week, we have two papers in a pretty good state (from an almost cold start), and we expect to submit them in the next couple of weeks. I'm not sure how much I can say about them before then, but I'll write a summary here once they appear on the arXiv.

I would thoroughly recommend future incarnations of this meeting to anyone. The website is here;




Greek Sieves and Corrupted Data

The largest simulation of my PhD finished a couple of days ago on our local cluster. It was a simulation of a total of 360 million binaries, combining datasets using around 300 different sets of physical assumptions. The whole lot took around 12 days running simultaneously on around 120 cores. That's around 35000 CPU hours, or just shy of 4 CPU years of processing! I will be using the simulations to try and pin down exactly how much we can hope to learn about our models of binary physics using gravitational wave observations. The paper is coming along nicely, and I'll write a summary here when it's done.

Alas, when I had run my postprocessing code to collect my results from the different bits of the cluster I sent it to, I found that I only had 359999189 binaries in my dataset (811 had gone missing!). This isn't a problem I can just ignore, since I make use of the fact that I simulate the same number of binaries, using the same initial conditions, for each setting of the physics. I need to work out what's missing. This turns out to be quite a challenge when you have a 40GB data file with 3.6\times 10^8 binaries to look through!

The main challenge is memory management. I can't load the whole file into memory (I even use a little bit of swap memory when loading in a single column). Luckily, I kept track of an ID type variable when constructing my dataset, so for one of the columns, I know that there should only be 1.2\times 10^6 unique values, repeated for each variation of the physics.

So, in summary, I have a column which should have 300 copies of the same 1.2\times 10^6 IDs. The problem is therefore to go through my dataset, and find out which ones have been skipped over for some of my datasets (in a way that doesn't kill the RAM on my machine).

The solution I came up with is partially based on the Sieve of Eratosthenes, which is a very simple and efficient way of finding all the prime numbers up to a given limit. You can see details of the algorithm here, however, when I first saw an implementation of it, it blew my mind. The trick is to make a boolean array, and then mark the elements whose indices are prime numbers. This is easier to see in code, here's my implementation from a few years ago in C++;

vector<bool> primeSieveUpTo(int max)
	vector<bool> sieve(max+1,true);
	sieve[0] = false;
	sieve[1] = false;
	for (int i=2; i<=max; i++)
		if (sieve[i])
			for (int j=2*i; j<=max; j+=i) sieve[j] = false;
	return sieve;

I realised I could use a similar trick (using the indices of an array to keep track of things) in my code. Here's my whole Python script, which efficiently goes through my 360000000 line file, and counts instances of each unique seed.

import numpy as np

# these are the 1.2e6 unique values I know will be in the 2nd column of
# my data
uniqueSeeds = np.loadtxt('uniqueSeeds.dat',dtype=np.int64)

seedUsageCounts = np.zeros_like(uniqueSeeds)

# my data file
f = open('allInitial.dat')

# skip the headers

for l in f:
	seed = int(l[:-1].split(',')[1])
	# add 1 to the index corresponding to the observed seed
	seedUsageCounts[uniqueSeeds == seed] += 1

np.savetxt('seedUsageCounts.dat', seedUsageCounts,fmt='%i')

I'm sure I could improve performance a little by doing things in batches, but it's memory I was trying to manage, not time. I can leave it running overnight!

Villiers Park

I was recently accepted to teach a week long residential astrophysics course at Villiers park, just outside Cambridge UK. It sounds like it's going to be loads of fun.

The basic idea is that they invite some of the best A-Level (17-18 year old) students to stay at Villiers Park for a week in September, whilst myself and another physicist teach them a series of topics broadly related to astrophysics, as well as generally hang out with them and show that we're not all Sheldon Coopers.

I'm currently pencilled in to give 90 minute sessions on each of

  • Universe at the largest scales
  • Special Relativity (with a little GR at the end)
  • Gravitational Waves
  • Exoplanets

I'm genuinely excited for this!

Random Samples From a Sphere

During my work today, at some point I had to draw samples uniformly from a sphere, and made an observation I hadn't really appreciated before. The name of the game is to make sure that every infinitesimal shell has the same density of samples, on average.

 \frac{4\pi r^2}{\left< N_{samp} \right>} = \mathrm{constant}

Or in other words, the volume gets more spread out, so we need more samples for higher r. The density is simply constant with respect to direction (angles). So, to get the correct distribution we simply need to sample from

 \theta \sim \mathcal{U}(0,2\pi)

 \phi \sim \mathcal{U}(0,\pi)

 r \sim r^2

Or in other words, just sample from a power law distribution with index 2. Elementary observation? Maybe, but it makes it easier to code up!

Gaussian Process Update Equations

When you start to learn about Gaussian processes, you come across the update equations fairly early on. The update equations are fairly intimidating to look at, and are typically dismissed as trivial to derive (for example, Rasmussen and Williams' simply point you towards a statistics book from the 1930's, which is neither available online nor in our university library...). I had a go at the derivation, and promptly realised it wasn't trivial at all from a cold start.

Fortuitously, at the PEN emulators workshop I recently attended, there was an introductory lecture from Jochen Voss, where he went through a univariate case, and then gave us the structure of the derivation for the full multivariate case. Luckily, this gave me the push I needed to try the derivation again, so I went away and filled in the gaps.

So here it is, in all it's glory; the derivation of the update equations for Gaussian processes.

The overall endgame of Gaussian process regression is to write down a conditional distribution P(y_p | y_t) for a set of predicted outputs y_p given a set of training (observed) outputs y_t. By the product rule

 P(y_p|y_t) = \frac{P(y_p,y_t)}{P(y_t)}

Since we have a constant set of data, P(y_t) is just a constant in this expression.

The underlying assumption in Gaussian process regression is that outputs are jointly Gaussian distributed, so that

 P(y_p|y_t) \propto P(y_p,y_t) \propto \exp\left[-\frac{1}{2} \left(\begin{array}{c} y_t \\ y_p \end{array}\right)^T\Sigma^{-1}\left(\begin{array}{c} y_t \\ y_p \end{array}\right)\right]

Where \Sigma is the joint covariance matrix. Remember that under the Gaussian process model we have trained a function which computes the elements of the Covariance matrix purely as a function of the inputs, it is only a function of the outputs y_p that we're trying to find. We can define the covariance matrix blockwise

 \Sigma = \left(\begin{array}{cc} T & C^T \\ C & P \end{array}\right)

Where T is the covariance matrix computed using only the training inputs x_t, P is the covariance matrix computed using the prediction inputs x_p, and C is the cross terms (i.e. the covariance \emph{between} y_t and y_p, computed using x_t and x_p). It is a well known result (it's in numerical recipes, or on Wikipedia) that you can blockwise invert a matrix;

 \Sigma^{-1} = \left(\begin{array}{cc}T^{-1} + T^{-1}C^T M CT^{-1} & -T^{-1}C^TM \\ -MCT^{-1} & M\end{array}\right)

Where M = (P-CT^{-1}C^T)^{-1}. So, we can directly compute our Gaussian density

 P(y_p|y_t) \propto \exp\left[-\frac{1}{2} y_t^T(T^{-1} + T^{-1}C^T M CT^{-1})y_t + \frac{1}{2}y_t^T (T^{-1}C^TM)y_p + \frac{1}{2}y_p^T (MCT^{-1})y_t - \frac{1}{2}y_p^TMy_p\right]

However, the only thing that isn't a constant here is y_p, so we can drop a bunch of terms (since we're only interested in the density, not absolute values)

 P(y_p|y_t) \propto \exp\left[\frac{1}{2}y_t^T (T^{-1}C^TM)y_p + \frac{1}{2}y_p^T (MCT^{-1})y_t - \frac{1}{2}y_p^TMy_p\right]

If we take the transpose of the middle term, we can group the terms together a bit more

 P(y_p|y_t) \propto \exp\left[\frac{1}{2}y_t^T (T^{-1}C^TM + (MCT^{-1})^T)y_p - \frac{1}{2}y_p^TMy_p\right]

Now, in general, a multivariate Gaussian has the form;

 \mathcal{N}(\tilde{y},\tilde{\Sigma}) \propto \exp\left[-\frac{1}{2}(y-\tilde{y})^T\tilde{\Sigma}^{-1}(y-\tilde{y})\right]

If we remember that covariance matrices are symmetric, we can expand, drop some constant terms and then rearrange this to

 \mathcal{N}(\tilde{y},\tilde{\Sigma}) \propto \exp\left[-\frac{1}{2}y^T\tilde{\Sigma}^{-1}y + \tilde{y}^T\tilde{\Sigma}^{-1}y\right]

we can therefore see that both P(y_p|y_t) and \mathcal{N}(\tilde{y},\tilde{\Sigma}) have exactly the same form. We can therefore straightforwardly match expressions for \tilde{\Sigma}.

 \tilde{\Sigma} = M^{-1} = P-CT^{-1}C^T

The expression for \tilde{y} requires a little more work. We start by matching terms

 \tilde{y}^T\tilde{\Sigma}^{-1} = \frac{1}{2}y_t^T (T^{-1}C^TM + (MCT^{-1})^T)

We can rearrange this a little bit

 \tilde{y}^T\tilde{\Sigma}^{-1} = \frac{1}{2}y_t^T T^{-1}C^T (M + M^T)

We know that M is a symmetric matrix (we just showed that its inverse is the covariance matrix). So, if we right multiply by the covariance matrix \tilde{\Sigma} and take the transpose, we finally arrive at

\tilde{y} = CT^{-1}y_t

And, so, in conclusion we know that

P(y_p|y_t) \sim \mathcal{N}(CT^{-1}y_t, P-CT^{-1}C^T)

So, not quite as trivial as the textbooks claim!


Over the last couple of weeks I've attended both the PEN (Past Earth Network) emulators workshop in the ultra-glamorous Leeds, UK, and the European Week in Astronomy and Space Science (EWASS) in the gorgeous Prague, Czech Republic.

I made a new poster for EWASS, with a slightly more detailed account of my emulators work. I'm more or less at the paper writing stage for that project (and another), so watch this space for arXiv numbers and accessible reviews in the coming couple of months!

SAMSI Workshop on Emulation

At the beginning of April I attended the SAMSI workshop on the emulation of astrophysical populations, just outside of Durham North Carolina. I helped co-author the official SAMSI blog post, which can be found here.

I also finally submitted my proceedings paper from the Astroinformatics conference last year to the arXiv, which you can look at over at arXiv 1704.03781.

I may get around to writing a summary post about it, although things are quite busy these days. I'll see what I can do.