So, that PhD month in review idea didn't work out so well... From now on I'll not be so ambitious and pledge to write a new blog post as and when I get time in between my real work.
In my April month in review I described random forests and Gaussian process regression which I was experimenting with in order to solve the problem of interpolation between population synthesis models. Back then, the project was in its infancy, and things have crystallised a great deal since then, so I'll do my best to give a summary of our progress.
Gaussian processes, whilst really cool, simply weren't suitable for this project, since the mapping from inputs to outputs is non-deterministic. I anticipate using them at some point for an extension of this project (keep your eyes peeled for that one). Similarly neural networks (which I promised to introduce in my review of May, sorry about that!) didn't really work out. Despite being really cool, they also struggle with non-deterministic functions. I did start to dabble with mixture density networks, however a stubborn bug in that code combined with the resounding success of a much simpler method led me to abandon it for the time being. That being said, I have an upcoming meeting about a project which may have me dusting off my neural network code. Watch this space.
k - Nearest Neighbours Regression (sort of)
Whilst reading around random forests, I realised that the way I was using them meant that I wasn't really doing random forests... I was doing bootstrap aggregated nearest neighbours regression, which is a bit of a mouthful but actually a very simple concept.
Nearest neighbours methods are quite popular in the machine learning community when it comes to discrete (classification) problems, where your observable outputs all fit nicely into discrete classes (cats, dogs, monkeys etc). Basically, you take the set of k training data whose inputs match the input of your test point as 'closely' as possible, and then combine their outputs in some way (e.g. majority voting).
Nearest neighbours regression takes this idea and applies it to continuous observable outputs, and uses some other combination (arithmetic mean, median etc) to combine the outputs into a point estimate.
Bootstrap aggregation ("bagging") is a means of squeezing down random fluctuations in your estimate (in much the same way as random forests do for decision trees). Basically, a nearest neighbours regression is performed on datasets drawn with replacement from the original dataset. My intuition for why this works is to imagine that I have a really really bad datapoint, that throws my estimate off. Using bootstrap aggregation, this datapoint will only show up ~2/3 of the time, so the final estimate will be less corrupted by it, even though the overall estimate will still converge on the truth for infinite data.
Notice that I said above that nearest neighbours regression combines outputs into a point estimate? This isn't what we want; for any particular combination of hyperparameters, the non-determinism of the binary evolution process means that there is a range of possible outputs, not a single set of numbers. Our problem is to interpolate between distributions of quantities, not point estimates.
Therefore, the method I have been employing is to densely sample hyperparameter space with binaries, and then to approximate the output distribution by taking the set of k binaries whose hyperparameters are closest to the point we want to test. If we are sufficiently densely sampled (and we assume that whichever distribution we're interested in changes smoothly with the hyperparameters) then the set outputs from these k closest inputs will form a good approximation to distribution at that point. This is more or less what I was doing with random forests, but more efficient and interpretable.
It has been pointed out to me that this is in some sense similar to approximate Bayesian computation (ABC), except in reverse. ABC is a method of inference used in situations where the likelihood function is either too difficult or impossible to compute. Instead, you run a simulation for your sample point, and assess the output against your data via some distance measure (which is typically a bit of a dark art to define). Smaller distances are then proxies for high likelihoods. We're instead assessing binaries on the 'closeness' of their inputs, and then using their outputs as 'samples'.
I'm not completely sure how useful this analogy is, but it certainly sounds more impressive than simply saying 'nearest neighbours', as I demonstrated on my latest conference poster, which I presented at the intractable likelihood conference in Lancaster.
I'm still working on this method and how best to apply it to the population synthesis problem, so I'm reluctant to present any preliminary results yet, but they're coming (as is hopefully a paper, but I'm not holding my breath yet). Watch this space.
I'm planning to write a complimentary blog post on the 'meta-work' I've been doing, developing a bunch of tools for dealing with large quantities of data. I've been messing around with HDF5 data formats, F2PY for wrapping python code around FORTRAN routines (which also involved learning FORTRAN...), as well as organising our group's journal club and a new thing; PhD book club, in which myself and a few colleagues collectively read through Thorne & Blandford's excellent textbook, before getting together to solve the exercises on the board.
I wont overdo it for this blog post, and I'll try to write another one soon.