As I was transitioning out of my Bachelor’s and into my Master’s, I picked up a project on the side. It boils down to “I trained a neural network to do predict some things that are useful for astrophysicists,” or more precisely “I trained conditional variational autoencoders to serve as surrogate models for the spectra of kilonovae.”

As far as machine learning goes, it’s not new, really. Even as far as astrophysics goes, it just applies a method that’s not used in this field. The thing is, though, it’s useful. It speeds up a really common analysis method something like 400 times*. This is important, first, for the discovery of kilonovae for which it latency is critical and second, for increasing the amount of knowledge that can be extracted from any given data set.

I published it as a NeurIPS Machine Learning for the Physical Sciences Workshop paper with the help of some collaborators I picked up along the way. We are also working on getting the work out as a much longer paper in an astronomy/astrophysics journal so that it reaches the intended audience. Additionally, we are making it easy to use by distributing it as a package.

If you’re looking for a fast surrogate model for the spectra of kilonovae, I recommend reading the aforementioned paper. This post is intended for a non-astrophysics audience. Here I want to share, first, the interesting science that motivated me to pick up this project, second, the one interesting machine learning detail that I came across while working on this, and third, some thoughts on working on things like this in general.

Kilonova physics

Not including the infinitely dense black holes, neutron stars are the densest objects in the universe. The merger of two neutron stars (or a neutron star and a black hole) is a perfect astrophysical event to study if you are interested in gravity, element formation, and cosmology. During the merger, the system emits gravitational waves, which we can detect and analyze to understand gravity. The system also ejects neutron-rich material, which undergoes a nucleosynthetic (atom-producing) process that forms the heaviest elements in the universe, like gold and platinum. After these atoms form, they decay and heat all the floating, ejected material, which emits photons for about 2 weeks after the merger. These photons, if we are lucky, we can see in our telescopes, and we call the temporary glow a kilonova. Looking at kilonovae can help us understand not just neutron stars themselves, but also the source of the heavy elements, fundamental principles of gravitation, and through a clever trick of combining a distance measurement from the gravitational waves and a redshift measurement from the kilonova, we can even measure the expansion rate of the universe, the Hubble constant.

Neutron star mergers are a powerful astrophysical laboratory, and we can collect insights to answer many disparate open astrophysical questions by studying them. As in almost every other scientific field, kilonova astrophysics also builds models and tries to infer the parameters of the models from data. The models that we can use for the inverse problem assume that there are 2-3 distinct types of stuff that are ripped off the neutron stars during the merger and that it is spherically symmetric. All this “stuff” is emitting photons from the surface, and we can compute how much energy is being produced at which wavelengths over time as the kilonova dims. Here is a cross-section through the center of the kilonova “sphere” that shows all the parameters of the model (based on Dietrich et al. 2020):

The input parameters of this model are the masses of purple material and the mass of the red material and blue material combined, plus the parameter \(\varphi\) which controls the relative ratio of red to blue material, and \(\theta_{obs}\), the angle at which we see the kilonova. The output is a time series of 500-dimensional spectra (or more, depending on the simulations). Here is a quick summary of the process.

Simulating this time series of spectra takes hours to days for a single set of parameters. This timescale makes our usual way of solving the aforementioned inverse problem, Bayesian inference via some sort of sampling method, basically impossible since sampling just means producing a lot of spectra, for a lot of parameter sets. To circumvent the issue, people create a surrogate model using the outputs of the physics simulations; this surrogate is then purely data-driven but still gives spectra and does it much faster than “hours to days” (e.g. a few milliseconds). The most common method in my field to build surrogates is Gaussian process regression.

I was curious if I could use some sort of neural network-based technique and maybe it would be faster. If the surrogate evaluation is faster, the Bayesian inference is faster. Bayesian inference in this field is the primary method of knowledge acquisition, meaning a faster surrogate enables more knowledge.

So, I created a surrogate model in the form of a conditional variational autoencoder to learn a mapping from the input parameters to the output spectra. I won’t give the details of the math behind VAEs because it’s already been said many times many ways, and I do not think I have anything new here to add (see here for a good resource).

The one interesting detail: variance shrinkage of Gaussian VAEs

To train a VAE, one has to choose an appropriate likelihood function for the generated and real spectra. For real-valued multi-dimensional data, one seemingly good choice would be a multidimensional Gaussian. Given some random sample \(\mathbf{z}\), a mean \(\mathbf{\mu}_{\theta}\) and variance \(\mathbf{\sigma}^2_{\theta}\) for a Gaussian distribution predicted by neural networks, the negative log-likelihood for an example spectrum centered at mean \(\mathbf{y}\) is

\[\log p_{\theta}(\mathbf{y | \mathbf{z}}) = \sum_{i=1}^{D} -\frac{1}{2\sigma_{\theta, i}^2(z)} || y_i - \mu_{\theta, i}(z) ||^2 - \frac{1}{2} \log (2 \pi \sigma_{\theta, i}^2(z))\]

However, if the model can learn the parameters \(\theta\) such that \(\mu_{theta}(z)\) provides a good reconstruction of \(y\), then the \(\log (\sigma_{\theta, i}^2(z))\) term will push the variance to zero before the \(\frac{1}{\sigma_{\theta, i}^2(z)}\) can catch up. So, you can’t actually learn a proper Gaussian likelihood, and people don’t usually try. This is why in most VAE implementations, a mean square error loss is used, setting \(\sigma = 1\) globally, or better yet, a binary cross-entropy error is used even when the target variables are not binary. This is because, in practice, optimizing the BCE loss is easier than optimizing the MSE loss in practice and all it requires is scaling your dataset to the range \([0,1]\) (and giving up any hopes of interpreting the output as a probability distribution).

No one ever talks about this because a) the usual examples are on the MNIST dataset and so BCE is always used (relevant sidenote on using BCE for continuous inputs) and b) this is just one of those little secrets that end up discussed in a paper eventually.

The cVAE works pretty well as a surrogate model for these spectra, but the consequence of variance shrinkage is that I need to rely on basic error verification techniques because I cannot learn the distribution. In our paper, we present some error metrics that are meaningful for astronomers i.e. errors on the data structures that astronomers actually observe. We also perform inference using the cVAE surrogate model for the singular kilonova we have ever observed and find that the cVAE surrogate is super fast and the parameters are good. I’m skipping details because you can go find them in the papers.

I trained a neural network and now it’s a paper

So at the end of the day, I trained a cVAE to do a thing it was meant to do, on a slightly messy dataset that required some domain knowledge to wrangle. To me, this doesn’t sound impressive or interesting (but I have also been working on this on and off for ages so maybe I’m just done with the topic).

Regardless of whether it’s impressive or interesting, it’s useful and it’s fast. It hopefully enables us to do more scientific analyses of kilonovae or maybe even be clever and incorporate Bayesian fitting into gravitational wave follow-up campaigns. Since I have spent time being curious about machine learning while doing my astrophysics degrees, I can now see solutions with techniques that my colleagues have never heard of. I like to think that I’ve proposed a good option for a small domain-specific problem while maintaining humility about the scope and limitations of our work(e.g. the errors discussed in our papers).

My collaborators and I are making the machine learning aspects of the work accessible to astrophysicists unfamiliar with machine learning, and we hope to advertise in the right places so that people actually use it. As for myself, I hope to move on to problems that are both interesting to me and useful to others.

*Obviously, your mileage may vary. This number refers to how much faster (on my laptop that is ~4 years old) is an evaluation of my surrogate model for a single input parameter set at a single time (< 1ms) with the evaluation of a Gaussian process surrogate model for a single input parameter set at a single time (~400ms). Training is not included in this, but training isn’t the process of Bayesian inference either. The surrogates are always trained beforehand and stored.

A big thank you to my collaborators Brian Nord, Zoheyr Doctor, Geert Raaijmakers, and Marcelle Soares-Santos who have contributed to this project along the way in various ways.