The largest simulation of my PhD finished a couple of days ago on our local cluster. It was a simulation of a total of 360 million binaries, combining datasets using around 300 different sets of physical assumptions. The whole lot took around 12 days running simultaneously on around 120 cores. That's around 35000 CPU hours, or just shy of 4 CPU years of processing! I will be using the simulations to try and pin down exactly how much we can hope to learn about our models of binary physics using gravitational wave observations. The paper is coming along nicely, and I'll write a summary here when it's done.

Alas, when I had run my postprocessing code to collect my results from the different bits of the cluster I sent it to, I found that I only had 359999189 binaries in my dataset (811 had gone missing!). This isn't a problem I can just ignore, since I make use of the fact that I simulate the same number of binaries, using the same initial conditions, for each setting of the physics. I need to work out what's missing. This turns out to be quite a challenge when you have a 40GB data file with binaries to look through!

The main challenge is memory management. I can't load the whole file into memory (I even use a little bit of swap memory when loading in a single column). Luckily, I kept track of an ID type variable when constructing my dataset, so for one of the columns, I know that there should only be unique values, repeated for each variation of the physics.

So, in summary, I have a column which should have 300 copies of the same IDs. The problem is therefore to go through my dataset, and find out which ones have been skipped over for some of my datasets (in a way that doesn't kill the RAM on my machine).

The solution I came up with is partially based on the Sieve of Eratosthenes, which is a very simple and efficient way of finding all the prime numbers up to a given limit. You can see details of the algorithm here, however, when I first saw an implementation of it, it blew my mind. The trick is to make a boolean array, and then mark the elements whose *indices *are prime numbers. This is easier to see in code, here's my implementation from a few years ago in C++;

vector<bool> primeSieveUpTo(int max)
{
vector<bool> sieve(max+1,true);
sieve[0] = false;
sieve[1] = false;
for (int i=2; i<=max; i++)
{
if (sieve[i])
{
for (int j=2*i; j<=max; j+=i) sieve[j] = false;
}
}
return sieve;
}

I realised I could use a similar trick (using the indices of an array to keep track of things) in my code. Here's my whole Python script, which efficiently goes through my 360000000 line file, and counts instances of each unique seed.

import numpy as np
# these are the 1.2e6 unique values I know will be in the 2nd column of
# my data
uniqueSeeds = np.loadtxt('uniqueSeeds.dat',dtype=np.int64)
seedUsageCounts = np.zeros_like(uniqueSeeds)
# my data file
f = open('allInitial.dat')
# skip the headers
f.readline()
f.readline()
for l in f:
seed = int(l[:-1].split(',')[1])
# add 1 to the index corresponding to the observed seed
seedUsageCounts[uniqueSeeds == seed] += 1
f.close()
np.savetxt('seedUsageCounts.dat', seedUsageCounts,fmt='%i')

I'm sure I could improve performance a little by doing things in batches, but it's memory I was trying to manage, not time. I can leave it running overnight!