PhD Update - July 2016

So, that PhD month in review idea didn't work out so well... From now on I'll not be so ambitious and pledge to write a new blog post as and when I get time in between my real work.

In my April month in review I described random forests and Gaussian process regression which I was experimenting with in order to solve the problem of interpolation between population synthesis models. Back then, the project was in its infancy, and things have crystallised a great deal since then, so I'll do my best to give a summary of our progress.

Gaussian processes, whilst really cool, simply weren't suitable for this project, since the mapping from inputs to outputs is non-deterministic. I anticipate using them at some point for an extension of this project (keep your eyes peeled for that one). Similarly neural networks (which I promised to introduce in my review of May, sorry about that!) didn't really work out. Despite being really cool, they also struggle with non-deterministic functions. I did start to dabble with mixture density networks, however a stubborn bug in that code combined with the resounding success of a much simpler method led me to abandon it for the time being. That being said, I have an upcoming meeting about a project which may have me dusting off my neural network code. Watch this space.

k - Nearest Neighbours Regression (sort of)

Whilst reading around random forests, I realised that the way I was using them meant that I wasn't really doing random forests... I was doing bootstrap aggregated nearest neighbours regression, which is a bit of a mouthful but actually a very simple concept.

Nearest neighbours methods are quite popular in the machine learning community when it comes to discrete (classification) problems, where your observable outputs all fit nicely into discrete classes (cats, dogs, monkeys etc). Basically, you take the set of k training data whose inputs match the input of your test point as 'closely' as possible, and then combine their outputs in some way (e.g. majority voting).

Nearest neighbours regression takes this idea and applies it to continuous observable outputs, and uses some other combination (arithmetic mean, median etc) to combine the outputs into a point estimate.

Bootstrap aggregation ("bagging") is a means of squeezing down random fluctuations in your estimate (in much the same way as random forests do for decision trees). Basically, a nearest neighbours regression is performed on datasets drawn with replacement from the original dataset. My intuition for why this works is to imagine that I have a really really bad datapoint, that throws my estimate off. Using bootstrap aggregation, this datapoint will only show up ~2/3 of the time, so the final estimate will be less corrupted by it, even though the overall estimate will still converge on the truth for infinite data.

My Method

Notice that I said above that nearest neighbours regression combines outputs into a point estimate? This isn't what we want; for any particular combination of hyperparameters, the non-determinism of the binary evolution process means that there is a range of possible outputs, not a single set of numbers. Our problem is to interpolate between distributions of quantities, not point estimates.

Therefore, the method I have been employing is to densely sample hyperparameter space with binaries, and then to approximate the output distribution by taking the set of k binaries whose hyperparameters are closest  to the point we want to test. If we are sufficiently densely sampled (and we assume that whichever distribution we're interested in changes smoothly with the hyperparameters) then the set outputs from these k closest inputs will form a good approximation to distribution at that point. This is more or less what I was doing with random forests, but more efficient and interpretable.

It has been pointed out to me that this is in some sense similar to approximate Bayesian computation (ABC), except in reverse. ABC is a method of inference used in situations where the likelihood function is either too difficult or impossible to compute. Instead, you run a simulation for your sample point, and assess the output against your data via some distance measure (which is typically a bit of a dark art to define). Smaller distances are then proxies for high likelihoods. We're instead assessing binaries on the 'closeness' of their inputs, and then using their outputs as 'samples'.

I'm not completely sure how useful this analogy is, but it certainly sounds more impressive than simply saying 'nearest neighbours', as I demonstrated on my latest conference poster, which I presented at the intractable likelihood conference in Lancaster.

I'm still working on this method and how best to apply it to the population synthesis problem, so I'm reluctant to present any preliminary results yet, but they're coming (as is hopefully a paper, but I'm not holding my breath yet). Watch this space.

Other Stuff

I'm planning to write a complimentary blog post on the 'meta-work' I've been doing, developing a bunch of tools for dealing with large quantities of data. I've been messing around with HDF5 data formats, F2PY for wrapping python code around FORTRAN routines (which also involved learning FORTRAN...), as well as organising our group's journal club and a new thing; PhD book club, in which myself and a few colleagues collectively read through Thorne & Blandford's excellent textbook, before getting together to solve the exercises on the board.

I wont overdo it for this blog post, and I'll try to write another one soon.

PhD Month in Review - April 2016

I've decided to try and be a bit more active on the blogging front, and to try and showcase my progress through my PhD. This month I started a whole new project, so it seems like the perfect time to start a new series of blog posts; PhD Month in review. The idea is pretty self evident, at the end of each month I'll write a post explaining the highlights of what I've been working on and what I've been learning about. So, without further ado, here's what I've been doing this April. I'll start by outlining my new project.

Population Synthesis

Two of my colleagues, Simon Stevenson and Alejandro Vigna-Gómez are currently working with Professor Ilya Mandel on a compact binary population synthesis code. The basic idea of the code is to spawn a bunch of pairs of stars, sit them next to each other and let them evolve for a hubble time, before noting down their final characteristics and, most importantly, whether or not they've formed compact objects and merged. I say 'most importantly' because these compact binary mergers are exactly what the guys down at aLIGO are looking for with their kick-ass interferometer; gravitational wave sources.

GW150914 showed us that there are definitely gravitational wave sources out there, but there is still a great deal of uncertainty about the full range of what is physically possible in terms of gravitational wave sources. What range of masses are possible in a black hole binary? What about spins? Does supernova kick velocity play a significant role? As more gravitational wave observations start rolling in, constraining these underlying bits of astrophysics becomes an important and interesting task.

A pair of merging black holes - Image from the SXS project (simulating extreme spacetimes)
A pair of merging black holes - Image from the SXS project (simulating extreme spacetimes)

Now we see why population synthesis codes are important. We can spawn a whole bunch of stars with, say, a wide distribution of masses, let them go for a while and then look at which ones have formed black holes and merged. Then we check against our gravitational wave observations and see if the range of masses we began with were sensible. If they weren't, then we go back and amend our code accordingly. For example, If our population synthesis code says that a pair of 30 solar mass black holes is wildly unlikely, we know we need to change our initial distribution of masses. That's an extreme case, but that's the basic idea. More technical details can be found in my colleague's paper.

Now, where do I come in?

Interpolation and Prediction

There are a few problems with simulating populations of compact binary mergers;

  1. They're expensive - Each pair of stars can take up to a couple of seconds to evolve. Doesn't sound like much, but it is when you think about the next problem...
  2. They're quite rare - you may need to evolve millions of pairs of stars to see an appreciable (statistically meaningful) number of mergers
  3. They come in many different flavours - I've mentioned mass, spin and supernova kicks already, but there a dozens of other characteristics that might affect merger rate.
  4. They're stochastic - Basically, even with two identical pairs of stars, one might merge and the other might not
  5. They're complicated - You can't write down a simple equation about what's going on!

So, in summary, these simulations are expensive to run and each simulation only gives you information about a tiny piece of a vast parameter space. My task; work out what we can say about the simulations that haven't been run yet, given the simulations that have. Most importantly, we need to be able to quantify our uncertainty in any predictions we make. This last month I've been doing a series of initial investigations of 3 different ways that I might tackle this problem; Gaussian processes, random forests and neural networks.

Stating the Problem

The problem I am hoping to tackle is actually a relatively general one. I start by drawing stars from some known distributions and sit them next to each other in some universe characterised by a series of hyperparameters, which we'll call \vec{\varphi}, evolve them according to some highly complicated and non-linear series of operations (with stochastic elements), which we'll encapsulate in a black box function f, and at the end we get a collection of output characteristics from the binary system, which we'll call \vec{y}. The goal of this project is to therefore characterise the conditional distribution


There are two rather different ways of visualising this problem, which each lend themselves to different methods. The first is closely related to the description given above, where we would try to develop some parametrised function g of the hyper parameters such that

\vec{y} = g(\vec{\varphi}|\vec{\theta})

Where \vec{\theta} are 'model parameters' specific to the type of model that we employ (random forest or neural network in my case). The aim would then be to come up with a function g which does a good job of explaining what we see.

The other approach is to consider both the inputs and the outputs on the same footing, as part of some hypersurface upon which we want to determine the population density of binaries, so that if we call our hypersurface z(\vec{y},\vec{\varphi}) then we want to know the binaries are distributed on that surface, i.e.

z(\vec{y},\vec{\varphi}) = \frac{1}{N}\frac{dN}{d(\vec{y},\vec{\varphi})}

Gaussian Processes

I am still working towards a good understanding of what Gaussian processes actually are. So far the picture in my head is that we make some assumption about how the outputs of a function vary with respect to the distance between the inputs of that function, which we encapsulate in a kernel function. In my own implementation (and most other peoples too) I use the squared exponential kernel

k(x,y)= A e^{\frac{-|x-y|^2}{2l}}

Which essentially says that the covariance of a pair points in output space falls off exponentially with their L^2 distance in inputs space. At a deeper level, Gaussian Processes work on a large space of possible functions f(x) and posit that these functions obey a Gaussian distribution with a mean function \mu(x)=\left<f(x)\right> and covariance over functions k(x,y)=\left<(f(x)-\mu(x))(f(y) - \mu(y)\right>. Using the nice properties of Gaussians it's then easy to make point predictions with uncertainties. See Rasmussen and Williams for more details.

I haven't really done so much work on this approach, but what little work I have done has thrown up some encouraging results (see the plot below). My basic method was to naively marginalise each dimension (by literally ignoring the others), make a histogram and use the bin edges as inputs and bin counts as outputs (so this is the hypersurface, population density approach described abaove). I  then use these to make predictions of unseen data.

I am largely sceptical that this approach is the right one for this project, since if we want to consider covariance between dimensions, either binning is going to become difficult or the computation is going to become intractable. That being said, I could well see this approach being part of a prediction pipeline somewhere. We shall see.

True (blue) and agnostically predicted (red) chirp mass distributions
True (blue) and agnostically predicted (red) chirp mass distributions

Random Forests

After playing around with Gaussian Processes for a while, I started looking for other regression methods popular in the machine learning community, and came across the concept of decision trees and, by extension, random forests. Decision trees take your data and partition it according to a binary decision. The decision could be 'has the binary merged or not?' or 'is the final chirp mass greater than 50 solar masses?'. After applying many of these decisions you end up with a bunch of 'leaf' nodes containing a small number of data points. Then, when you want to predict the output for a previously unknown input, you follow the tree down to the corresponding leaf node and your prediction is some average of the observed outputs there.

Random forests extend this idea by growing these decision trees for a random subset of the data until each leaf node contains only one observation, then a prediction is an average over the predictions made by all trees in the 'forest'.

This approach lends itself to the input/output way of formulating the problem, and is nice since it doesn't require any naive marginalisation. I used the scikit-learn implementation of a binary decision tree to make predictions, and got excellent preliminary results (see the plot below). I feel as though I may very well come back to this approach at some point in the future. My supervisor has warned that as I increase dimensionality beyond the simple test cases I'm currently using, tree based approaches may start to fall down since larger and larger fractions of the parameter volume is occupied by the surface. I will look into this further, but this was only the second regression algorithm that I tried...

True (blue) and predicted (red) mass distributions.
True (blue) and predicted (red) mass distributions.

Neural Networks (and Mixture Density Networks)

The rest of this month I have spent doing some preliminary reading around neural networks and working towards implementing one. I will leave a more detailed description for next month's blog post, when I have more idea of what's going on. For the time being, you can follow my development progress via the github repo I've set up to contain all of my work related to this project; Tools for Interpolation and Prediction of Population SYnthesis (TIPPSY).   (Update 25/07/2016 - I'm no longer maintaining this repository publicly, and will instead release my interpolation tools with my group's population synthesis code as a whole, when it's ready)