Currently, we do not have a good theoretical understanding of how or why neural networks actually work. For example, we know that large neural networks are sufficiently expressive to compute almost any kind of function. Moreover, most functions that fit a given set of training data will not generalise well to new data. And yet, if we train a neural network we will usually obtain a function that gives good generalisation. What is the mechanism behind this phenomenon?
There has been some recent research which (I believe) sheds some light on this issue. I would like to call attention to this blog post:
This post provides a summary of the research in these three papers, which provide a candidate for a theory of generalisation:
(You may notice that I had some involvement with this research, but the main credit should go to Chris Mingard and Guillermo Valle-Perez!)
I believe that research of this type is very relevant for AI alignment. It seems quite plausible that neural networks, or something similar to them, will be used as a component of AGI. If that is the case, then we want to be able to reliably predict and reason about how neural networks behave in new situations, and how they interact with other systems, and it is hard to imagine how that would be possible without a deep understanding of the dynamics at play when neural networks learn from data. Understanding their inductive bias seems particularly important, since this is the key to understanding everything from why they work in the first place, to phenomena such as adversarial examples, to the risk of mesa-optimisation. I hence believe that it makes sense for alignment researchers to keep an eye on what is happening in this space.
If you want some more stuff to read in this genre, I can also recommend these two posts:
Recent Progress in the Theory of Neural Networks
Understanding "Deep Double Descent"
EDIT: Here is a second post, which talks more about the "prior" of neural networks:
Deep Neural Networks are biased, at initialisation, towards simple functions