This might be the first time that training on synthetic data has proven to be useful? Can't say I've seen a lot of papers about training on synthetic data, but from what I've heard, it basically doesn't work. They admit that this approach still suffers from catastrophic forgetting, and it's also computationally expensive. Still, if this works at all, it's a good sign that using synthetic data is not a doomed approach.
This isn't really self-improvement, more like self...refinement? It's akin to a human learning not only by reading a book, but also by writing down a list of key points from the book and learning from those as well. You can't make LLMs learn truly novel stuff that way, but it can be useful for making them better at learning obscure info that is scarce in the training data.
I am not affiliated with the authors, mainly posting this to get some technical commentary on it. Full arXiv paper here.
They use Llama 3.2 1B - Instruct and claim massive improvements on self-edit policy for selected ARC 1 tasks (20% to 72% jump, with a 100% upper bound for hand-crafted self-edit solutions). I see the paper as a clear demonstration of a LLM selecting its own weight updates to better answer questions and I feel the implications are massive (if it can be scaled up in the big labs). I don't have the technical experience however for a full dive.