New Paper - Accuracy of Inference

A couple of weeks ago I put my newest paper on to the arXiv, which you can find here; https://arxiv.org/abs/1711.06287

The paper describes how much we can hope to learn from gravitational-wave observations of black holes, focusing in particular on how well we can constrain our models of the evolution of binary stars. In order to calculate this I used COMPAS, the binary evolution simulation software we are developing here in Birmingham, and a technique called Fisher information matrices.

If you have a model, which depends on some parameters \{\lambda\}, and makes predictions about some observable random variable, then the Fisher information quantifies how sensitive the expected observations are to changes in the underlying parameters. If some parameters have a large effect on our expected observations, then it makes sense that we can learn a lot about them.

I calculated the Fisher matrix by running COMPAS (a lot!) and seeing how the output changed with respect to four different bits of physics; the strength of supernova kicks, the ferocity of stellar winds during two different stellar evolutionary phases, and the frictional energetics during the common envelope phase. Below is the money plot from the paper.

The plot shows how accurately we will be able to measure each of these parameters after 1000 observations of binary black holes. The fuzziness comes from a quantification of the various sources of uncertainty in this process. It also shows how the different parameters behave with respect to each other (their correlation). The way I like to think about this plot is in terms of null hypothesis testing. If we find the truth is a massive outlier with respect to these distributions, then we reject the null hypothesis that our model is correct. This plot shows that the model doesn't have to be wrong by more than a few percent before we can rule out our model.

Intense Collaboration à Clermont-Ferrand

I've spent the last week working in the beautiful (and intensely hot) Clermont-Ferrand in the south of France, in the fourth edition of the COIN residence program. These residence programs are aimed to bring together a diverse collection of astrophysicists (and others!) to live together in a house for a week, and collaborate intensely on a couple of statistical and/or machine learning problem in astrophysics. The house itself is super luxurious (by British standards, at least)

I had no idea how much work could be done in a single week, we have two papers in a pretty good state (from an almost cold start), and we expect to submit them in the next couple of weeks. I'm not sure how much I can say about them before then, but I'll write a summary here once they appear on the arXiv.

I would thoroughly recommend future incarnations of this meeting to anyone. The website is here; https://iaacoin.wixsite.com/crp2017

 

 

 

Greek Sieves and Corrupted Data

The largest simulation of my PhD finished a couple of days ago on our local cluster. It was a simulation of a total of 360 million binaries, combining datasets using around 300 different sets of physical assumptions. The whole lot took around 12 days running simultaneously on around 120 cores. That's around 35000 CPU hours, or just shy of 4 CPU years of processing! I will be using the simulations to try and pin down exactly how much we can hope to learn about our models of binary physics using gravitational wave observations. The paper is coming along nicely, and I'll write a summary here when it's done.

Alas, when I had run my postprocessing code to collect my results from the different bits of the cluster I sent it to, I found that I only had 359999189 binaries in my dataset (811 had gone missing!). This isn't a problem I can just ignore, since I make use of the fact that I simulate the same number of binaries, using the same initial conditions, for each setting of the physics. I need to work out what's missing. This turns out to be quite a challenge when you have a 40GB data file with 3.6\times 10^8 binaries to look through!

The main challenge is memory management. I can't load the whole file into memory (I even use a little bit of swap memory when loading in a single column). Luckily, I kept track of an ID type variable when constructing my dataset, so for one of the columns, I know that there should only be 1.2\times 10^6 unique values, repeated for each variation of the physics.

So, in summary, I have a column which should have 300 copies of the same 1.2\times 10^6 IDs. The problem is therefore to go through my dataset, and find out which ones have been skipped over for some of my datasets (in a way that doesn't kill the RAM on my machine).

The solution I came up with is partially based on the Sieve of Eratosthenes, which is a very simple and efficient way of finding all the prime numbers up to a given limit. You can see details of the algorithm here, however, when I first saw an implementation of it, it blew my mind. The trick is to make a boolean array, and then mark the elements whose indices are prime numbers. This is easier to see in code, here's my implementation from a few years ago in C++;

vector<bool> primeSieveUpTo(int max)
{
	vector<bool> sieve(max+1,true);
	
	sieve[0] = false;
	sieve[1] = false;
	
	for (int i=2; i<=max; i++)
	{
		if (sieve[i])
		{
			for (int j=2*i; j<=max; j+=i) sieve[j] = false;
		}
	}
	
	return sieve;
}

I realised I could use a similar trick (using the indices of an array to keep track of things) in my code. Here's my whole Python script, which efficiently goes through my 360000000 line file, and counts instances of each unique seed.

import numpy as np

# these are the 1.2e6 unique values I know will be in the 2nd column of
# my data
uniqueSeeds = np.loadtxt('uniqueSeeds.dat',dtype=np.int64)

seedUsageCounts = np.zeros_like(uniqueSeeds)

# my data file
f = open('allInitial.dat')

# skip the headers
f.readline()
f.readline()

for l in f:
	seed = int(l[:-1].split(',')[1])
	# add 1 to the index corresponding to the observed seed
	seedUsageCounts[uniqueSeeds == seed] += 1
	
f.close()

np.savetxt('seedUsageCounts.dat', seedUsageCounts,fmt='%i')

I'm sure I could improve performance a little by doing things in batches, but it's memory I was trying to manage, not time. I can leave it running overnight!

Villiers Park

I was recently accepted to teach a week long residential astrophysics course at Villiers park, just outside Cambridge UK. It sounds like it's going to be loads of fun.

The basic idea is that they invite some of the best A-Level (17-18 year old) students to stay at Villiers Park for a week in September, whilst myself and another physicist teach them a series of topics broadly related to astrophysics, as well as generally hang out with them and show that we're not all Sheldon Coopers.

I'm currently pencilled in to give 90 minute sessions on each of

  • Universe at the largest scales
  • Special Relativity (with a little GR at the end)
  • Gravitational Waves
  • Exoplanets

I'm genuinely excited for this!

Random Samples From a Sphere

During my work today, at some point I had to draw samples uniformly from a sphere, and made an observation I hadn't really appreciated before. The name of the game is to make sure that every infinitesimal shell has the same density of samples, on average.

 \frac{4\pi r^2}{\left< N_{samp} \right>} = \mathrm{constant}

Or in other words, the volume gets more spread out, so we need more samples for higher r. The density is simply constant with respect to direction (angles). So, to get the correct distribution we simply need to sample from

 \theta \sim \mathcal{U}(0,2\pi)

 \phi \sim \mathcal{U}(0,\pi)

 r \sim r^2

Or in other words, just sample from a power law distribution with index 2. Elementary observation? Maybe, but it makes it easier to code up!

EWASS and PEN

Over the last couple of weeks I've attended both the PEN (Past Earth Network) emulators workshop in the ultra-glamorous Leeds, UK, and the European Week in Astronomy and Space Science (EWASS) in the gorgeous Prague, Czech Republic.

I made a new poster for EWASS, with a slightly more detailed account of my emulators work. I'm more or less at the paper writing stage for that project (and another), so watch this space for arXiv numbers and accessible reviews in the coming couple of months!

SAMSI Workshop on Emulation

At the beginning of April I attended the SAMSI workshop on the emulation of astrophysical populations, just outside of Durham North Carolina. I helped co-author the official SAMSI blog post, which can be found here.

I also finally submitted my proceedings paper from the Astroinformatics conference last year to the arXiv, which you can look at over at arXiv 1704.03781.

I may get around to writing a summary post about it, although things are quite busy these days. I'll see what I can do.

Chats and Virtual Networks

My azure job manager I mentioned last time worked rather nicely. I found myself running into various usage limits, which the people on the service desk happily removed for me, until I railed up against the public IP limit. Turns out I'd been misunderstanding how to structure my queue. I have now rewritten the code so that instead of launching 60 virtual machines with 60 unique IPs, I now have a 'head node' of sorts, with a public IP, and a virtual network of virtual machines hidden behind that one. I just submitted a big run using this new queuing system. We'll see how it pans out.

Our department was also visited by Iain Murray of Edinburgh this week. I had seen him speak a couple of years ago at the intractable likelihood conference in Bristol, and then more recently I bumped into his PhD student at a workshop in Warwick (who, as it turns out, also works on emulation of physical simulations), so I suggested him as a speaker. He gave an excellent talk about density estimation and neural networks in astronomy contexts, and we had a lot of interesting talks about my own work, and he provided some useful suggestions regarding Gaussian processes.

I also started reacquainting myself with the Gaussian process work flows I had developed before the COMPAS writing workload piled up, and I'll spend the next week working on these, as well as nursing simulations and data generation.

My Azure job manager

I've spent the last week or so investigating  exactly what I can do with my Microsoft Azure grant. As I listed in my last blog post, I want to be able to split my job up into a bunch of small chunks, which I can then distribute out to some humble linux virtual machines, and then collect the results back again.

I started by investigating Azure-Batch, which is Azure's built in way of mimicking a gigantic cluster. However, after spending a couple of days wrestling with their broken tutorial, I discovered that they don't support the service for their simplest VMs, which means I would have to use a more expensive one, which would effectively half my CPU hours.

I then spent some time trying to work out how to deploy a virtual machine usig their python API. Again, the tutorials are largely broken or incomprehensible, and in the end I ended up coming across the Azure-CLI (command line interface). This was a lot more intuitive (or at least, one of their CLIs is, they appear to have two distinct ones), so I set about writing some code which will use the CLI to launch virtual machines.

The code took a couple of days to write and debug, but it's in a more or less complete state. You can find it here.

Alas, it turns out I now rail against limits imposed by Microsoft, stopping me from launching more than 20 VMs concurrently. The website explaining the limits is pretty difficult to follow, so I've raised a service desk ticket to ask them to help me out.

In other news, we locked in the first "stable" version of our population synthesis code - COMPAS, which is the version we used to write the first COMPAS paper (watch this space).

GAIA Data and the Plan for Azure

First things first; I've never learned how to display code in a blog post before. I found a few tutorials suggesting the <pre> html tags. Let's see how they look with the obligatory "Hello World" program;

#include <iostream>
int main()
{
    std::cout<<"Hello World!"<<std::endl;
    return 0;
}

Excellent, that looks good. I was approached (via Reddit, of all places) to take part in a project involving data from the GAIA telescope. I heard a bit about the recent data release whilst I was in Trieste towards the end of last year, but, being predominantly a simulations guy, I've never really worked on telescope data before, so I thought it might be a nice way to cut my teeth on "real" data. I'm currently in the process of downloading the whole GAIA catalogue, using a simple little python script which basically crawls their 'direct download' page downloading all the data it finds.

The other plan for the next week or so is to develop a framework for running our COMPAS code in a scalable way through Microsoft Azure, to start making real use of my grant. I think that Python should be a pretty natural choice for writing the framework. The rough outline of what I want to achieve is the following;

  • Download and compile COMPAS on a simple virtual machine. Preliminary investigations suggest that I can probably get away with the simplest (A0) core, which has 750MB of RAM and a single core.
  • Find  a way of then cloning this virtual machine, ideally coupled with a configuration file so that each node can do something slightly different.
  • Collate the outputs from all of these nodes to some central place, and then, when they're all done, download them to Birmingham.
  • Automatically shut down all of my virtual machines, so that I'm not spending my allowance unnecessarily.

I'm not sure how possible all of this is. If I'm happy with the framework I end up with, I'll publish it on github, but we'll see.