GAIA Data and the Plan for Azure

First things first; I've never learned how to display code in a blog post before. I found a few tutorials suggesting the <pre> html tags. Let's see how they look with the obligatory "Hello World" program;

#include <iostream>
int main()
    std::cout<<"Hello World!"<<std::endl;
    return 0;

Excellent, that looks good. I was approached (via Reddit, of all places) to take part in a project involving data from the GAIA telescope. I heard a bit about the recent data release whilst I was in Trieste towards the end of last year, but, being predominantly a simulations guy, I've never really worked on telescope data before, so I thought it might be a nice way to cut my teeth on "real" data. I'm currently in the process of downloading the whole GAIA catalogue, using a simple little python script which basically crawls their 'direct download' page downloading all the data it finds.

The other plan for the next week or so is to develop a framework for running our COMPAS code in a scalable way through Microsoft Azure, to start making real use of my grant. I think that Python should be a pretty natural choice for writing the framework. The rough outline of what I want to achieve is the following;

  • Download and compile COMPAS on a simple virtual machine. Preliminary investigations suggest that I can probably get away with the simplest (A0) core, which has 750MB of RAM and a single core.
  • FindĀ  a way of then cloning this virtual machine, ideally coupled with a configuration file so that each node can do something slightly different.
  • Collate the outputs from all of these nodes to some central place, and then, when they're all done, download them to Birmingham.
  • Automatically shut down all of my virtual machines, so that I'm not spending my allowance unnecessarily.

I'm not sure how possible all of this is. If I'm happy with the framework I end up with, I'll publish it on github, but we'll see.

 \sqrt{3} or not  \sqrt{3} , that is the question.

I found a subtle bug in our code at the end last week, concerning how we make draws from a Maxwell-Boltzmann distribution. We use this distribution to determine the strength of the kick given to neutron stars and black holes from the asymmetry of their supernova. The distribution looks like this;

P(v_\textrm{kick} | \sigma_\textrm{kick}) = \sqrt{\frac{2}{\pi}} \; \frac{v_\textrm{kick}^2}{\sigma_\textrm{kick}^3} \exp\left(\frac{-v_\textrm{kick}^2}{2\sigma_\textrm{kick}^2}\right)

Where \sigma_\textrm{kick} is a scale parameter known as the velocity dispersion, and is also known as the root-mean-squared (RMS) velocity. That's where the confusion started.

I wrote a small unit test to check the sample mean, variance and RMS values that we draw match up with the analytical values. The RMS was bang on, but the mean and variance were way off. We noticed, however, that the sample mean was consistently \sqrt{3} away from the analytical one. I dug into the code and discovered that we were deliberately dividing the drawn velocities by \sqrt{3}. I asked my colleague why this was, and he said it was to make the sample RMS match \sigma_\textrm{kick}, since that is known as the "RMS velocity" after all.

It turns out that this would indeed be the right thing to do... in 1 dimension. I tidied up the code and reinserted the factor of sqrt{3} and made a vow to always call \sigma_\textrm{kick} the dispersion and not RMS velocity in future to avoid confusing myself.

I <3 Microsoft

I've had a very productive week and a bit since my last post. I've managed to more or less finish my hacking of AMUSE. I managed to change the underlying hyperparameters in SeBa (the population synthesis code) from the AMUSE side (meaning I can write nice python scripts for everything).

I also developed a Python framework for distributed computing on our ht_condor cluster (or at least I re-purposed one of my earlier frameworks). It's pretty crude, but it works. I'm more or less there with this work now, I only need to work out how the random numbers are seeded from AMUSE->SeBa.

I then started working on a unit-testing framework for our in-house population synthesis code (COMPAS). I chose the Boost.Test testing framework. I like the framework, it's super easy to write. Alas, not so easy to compile. However, after around a day and a half, I found the magic command;

ls *.o | grep -v main.o | xargs ld -r -o unitTests/COMPAS.o

See, theĀ  problem was that Boost.Test makes its own main(), meaning that linking blindly gives us issues with multiple main declarations. The above command lists all of the .o files except main.o, then links them. I'm embarrassed how long it took me to find this command.

Finally, I had some amazing news on Thursday... I've been awarded $20000 of Microsoft azure credit for my research. Chuffed doesn't even cover it. So the next few weeks will be setting myself up there.

Finding A MUSE

Shortly before the Christmas break, we had a school colloquium from Simon Portegies Zwart from Leiden. He gave a fantastic talk about our sun's stellar siblings and some potential explanations for the orbits of some of the massive bodies in the far reaches of our solar system.

Around about the same time, it became clear that I would need to invest some time in an alternative population synthesis code to our in-house one, so my supervisor suggested I take a look at SeBa, a code written by none other than Simon Portegies Zwart. I had a quick glance at it before the break, but other deadlines were more pressing, so I never really got to have a thorough look at it until this week.

SeBa is only available through Simon's AMUSE (Astrophysical Multipurpose Software Environment) platform, so I forked that repo and set to work trying to understand how everything works, and bend it to my will. I've managed to produce and evolve a binary population, and I've also managed to hack in the ability to vary the population hyperparameters from the higher level AMUSE code.

I have mixed opinions about AMUSE at this early stage. It's clearly an amazing piece of work if you want to interface mature codes together, however it was more challenging that I would have liked it to be to pick up the high level syntax and to modify the included codes.

That being said, it's only 3 days in, and I haven't run screaming yet.

New Year, New Start

So I didn't exactly keep up with my plot of the week idea. It turned out that I wasn't really producing a pretty plot every week, and then when I let the blog slip for a couple of weeks, that then subconsciously gave me the excuse to let it slip altogether.

However, we have just entered the new year, and so it's time to make some well-meaning resolutions! I resolve to blog more often, even if it's a single sentence or so. Perhaps I'll try and devote myself to a lengthier post every so often, however, I'm not going to be so optimistic this time!

I'll kick things off with a summary of today's activities. It's bee the first day back of the enw year (although I'm the only one in the office...). It's been a slow start, basically revamping this website and having a bit of a tidy-up (both of my physical desk and of the files on my laptop).

"Plot" of the Week 12th - 16th August

Alas, the first week where I haven't managed to produce any plot worthy of this blog. I spent most of the week dividing my time between helping to debug various aspects of our population synthesis code (specifically my bit, which writes the output to the HDF5 format) and also improving the efficiency and usability of my embarrassing parallelisation code.

That isn't to say that my code is embarrassing. The idea of embarrassing (AKA trivial) parallelisation is to chunk up your code in such a way that the chunks don't need to talk to each other at all. Population synthesis is obviously very amenable to this, since every binary is independent. I've written some code that divides up a very large population into manageable sized pieces and then submits them all to the queue of our local cluster.

I'll make sure I actually have a plot to show next week!

Plot of the Week 5th - 9th September

I spent this week attending the ICIC Bayesian data analysis workshop at Imperial College in London. I learned a load of useful stuff, but interesting plots were thin on the ground!

One thing I was really pleased to have gotten out of the workshop was a greater understanding of the Hamiltonian Monte Carlo (HMC) sampling algorithm, and I was even able to implement the sampler myself (although it's the worlds most delicate sampler, and fails to converge if there's a strong breeze outside).

As part of the workshop, we were given a dataset of supernova observations, from which we were able to infer the values of a handful of cosmological parameters. I don't claim to know a whole lot about cosmology, but corner plots are pretty, and I made this one using my handwritten HMC sampler.


Plot of the Week 29th August - 2nd September

This week's plot isn't work related, since I spent the whole week at the CRISM masterclass in sparse regression at the University of Warwick, which was basically an introduction to both ridge regression and lasso regression. It was hard work; since it was organised by mathematicians there was a lot more focus on optimality conditions and proofs of various things, however I think I got the basics of what was going on.

In the breaks between lectures, I was messing around with a dataset from my own life; my chess record! I've plotted up all 86 rated games I've played in my lifetime together with my official ECF grade (calculated twice per year), together with my 'live grade', where I calculate a new ECF grade for myself after every game.

There are things wrong with this plot. One would expect my 'live grade' to match up with my official one. There are a few reasons for why this could be. I neglected a couple of results of ungraded players (the rules for calculating grades aren't clearly defined in this case), I think my grade may have officially expired in the big gap in the middle (I don't take this into account right now) and I may have misunderstood or miscoded the calculation.

In any case, my first match of the new season is less than a week away, and I'm planning on making a new page on this site dedicated to this graph (and maybe others), and keep them up to date.


Plot of the Week 22nd - 25th August

This week I spent a lot of time worrying about inferring derivatives from observed data. This weeks plot was generated by the following method. Consider a Taylor series of a function of two variables

 f(x +\delta x, y +\delta y) = f(x,y) \; +\; \delta x \frac{\partial f}{\partial x} \; +\; \delta y \frac{\partial f}{\partial y} \; +\; \frac{1}{2}(\delta x)^2 \frac{\partial^2 f}{\partial x^2} \; +\; \frac{1}{2}(\delta y)^2 \frac{\partial^2 f}{\partial y^2} \; +\; \delta x\delta y \frac{\partial^2 f}{\partial x \partial y} + O(\delta^3)

And we have a set of observations \{f(x_i,y_i\}, then we can straightforwardly calculate a set of 'difference' observations \{f(x_i,y_i),f(x +\delta x, y +\delta y),\delta x_i,\delta y_i, \delta x_i^2, \delta y_i^2\}, then we can do a multiple linear regression on the derivatives

 f(x +\delta x, y +\delta y) = f(x,y) \; +\; \delta x A \; +\; \delta y B \; +\; \frac{1}{2}(\delta x)^2 C \; +\; \frac{1}{2}(\delta y)^2 D \; +\; \delta x\delta y E + F

solving for \{A,B,C,D,E,F\}. The results are quite encouraging, although quite sensitive to the 'window', or the range of distances used to calculate the derivatives. The plot below shows the fraction of merging binaries as a function of supernova kick velocity (the Newton's 3rd law style kick from asymmetric supernovae) and the common envelope efficiency (the fraction of orbital energy available to throw off the shared envelope of a binary), with the estimated direction of steepest ascent in the top panel.



Plot of the week 15th - 19th August

So I thought I'd try a new series, where I'll try and write a short paragraph about a plot from my research each week. Over the last week or so, I've been simultaneously recovering from a week off, and having another look at my first project, to do with Continuous Auto-Regressive Moving Average (CARMA) models. I had written some code to implement a new way of expressing and computing the likelihood function for CARMA as a model for correlated noise. We have known for a long time that our method is plagued by numerical instability, however we didn't know exactly why until now.

It turns out that since the CARMA model essentially says that you can capture the properties of correlated noise by looking only at the last few time samples (together with a couple of global statistics). What we discovered is that if these local windows are sampled at the Nyquist frequency (or some integer multiple thereof), then a certain matrix in our calculation goes singular, and all bets are off. The plot below shows how the log likelihood deviates from the correct value as you approach one of these points.