Great report — I found the argument that ELK is a core challenge for alignment quite intuitive/compelling.
To build more intuition for what a solution to ELK would look like, I’d find it useful to talk about current-day settings where we could attempt to empirically tackle ELK. AlphaZero seems like a good example of a superhuman ML model where there’s significant interest (and some initial work: https://arxiv.org/abs/2111.09259) in understanding its inner reasoning. Some AlphaZero-oriented questions that occurred to me:
A separate question that’s a bit further afield— Is it useful to think about eliciting latent knowledge from a human? For example, I might imagine sitting down with a Go expert (perhaps entirely self-taught so they don’t have much experience explaining to other humans), playing some games with them and trying to understand why they’re making certain decisions. Is there any aspect of the ELK problem that this scenario does/doesn’t capture?
Evan's response (copied from a direct message, before I was approved to post here):
It definitely makes sense to me that early stopping would remove the non-monotonicity. I think a broader point which is interesting re double descent, though, is what it says about why bigger models are better. That is, not only can bigger models fit larger datasets, according to the double descent story there's also a meaningful sense in which bigger models have better inductive biases.
The idea I'm objecting to is that there's a sharp change from one regime (larger family of models) to the other (better inductive bias). I'd say that both factors smoothly improve performance over the full range of model sizes. I don't fully understand this yet, and I think it would be interesting to understand how bigger models and better inductive bias (from SGD + early stopping) come together to produce this smooth improvement in performance.
One caveat worth noting about double descent – it only appears if you train far longer than necessary, i.e. "train forever".
If you regularize with early stopping (stop when the performance on some validation set stops improving), the effect is not present. Since we use early stopping in all realistic settings, performance always improves monotonically with more data / bigger models.
To rephrase, analyzing the weird point where models reach zero training loss will produce confusing results. The early stopping point exhibits no such weird non-monotonic behavior.