This paper came out about a week ago. I am not the author - it was published anonymously.
I learned about the paper when JJ Hepburn shared it on Slack. I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet:
Paper title: "Large Language Models Can Self-improve"
Abstract: "Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement."
Makes sense that something like this would be possible. Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth. Of course, this often runs into mostly useless, self-reinforcing attractors when the learner is unable to determine the true quality levels of their own attempts in the area of study. E.g., physics cranks, paranormal studies, etc.
I actually think it’s a good thing that chain of thought models can improve themselves like this. Chain of thought LLMs seem by far the most alignable of the plausible paths to AGI, since it directly incorporates a prior towards human-like cognition. I’d be a bit more worried about alignment if AGI approaches based on imitating / amplifying humans showed signs of flagging even at the current infra-human capabilities stage.
Further, stable self improvement seems like a particularly sensitive requirement for a scalable alignment solution. It seems much better to have the medium of this self improvement be human language, rather than something like self-modifying meta reinforcement learning from scratch.
Edited to add three more reasons why I think this is a good thing:
Crucially, this is a domain where convergence wasn't necessarily implied by the LM training objective. Language models are trained to imitate human text autoregressively given a single context. It's not a surprise when they generate human-like continuations of their provided contexts. That's what they're trained to do. However, that training objective does not directly imply that the LM should be capable of a human-esque self improvement trajectory that spans multiple prompt-completion pairs. I thus take this result as an update towards there being convergence in the higher order learning dynamics of humans and AIs.
I really appreciated this comment for making the connection between this paper and IDA.
More explicitly, to the extent that you can think of the original large language model as simulating a human, there's an analogy between:
This is a also a great chance for IDA skeptics to try to empirically demonstrate issues, either by finding failures of capabilities to scale, or by demonstrating alignment failures in amplified systems.
Definitely, but I currently feel that the vast majority of human learning comes with a ground truth to reinforce good habits. I think this is why I'm surprised this works as much as it does: it kinda feels like letting an elementary school kid teach themself math by practicing certain skills they feel confident in without any regard to if that skill even is "mathematically correct".
Sure, these skills are probably on the right track toward solving math problems - otherwise, the kid wouldn't have felt as confident about them. But would this approach not ignore skills the student needs to work on, or even amplify "bad" skills? (Or maybe this is just a faulty analogy and I need to re-read the paper)
You do need a minimum degree of competence in the domain before your own judgement is sufficient to tell the difference between good and bad attempts. Though even for children, there are domains simple enough that they can make that determination. E.g., learning to stack blocks on top of each other has an obvious failure state, and children can learn to do it through trial and error, even though there is probably not a genetically hardcoded reward circuit for correctly stacking things on top of other things.
Math is a much more complex domain where self-directed learning works well, because mathematicians can formally verify the correctness of their attempts, and so have a reliable signal to identify good attempts at proving a theorem, developing a new approach, etc.
"Anonymous" and 540B parameters, hmm... I'm sure it's not from the company named after an even larger number.
GSM8K = grade school math word problems
DROP = reading comprehension benchmark requiring discrete reasoning over paragraphs
OpenBookQA = question-answering dataset modeled after open book exams for assessing human understanding of a subject. 5,957 multiple-choice elementary-level science questions
ANLI-A3 = adversarial benchmark designed to be challenging to current state-of-the-art models
This paper renames the decades-old method of semi-supervised learning as self-improvement. Semi-supervised learning does enable self-improvement, but not mentioning the hundreds of thousands of papers that previously used similar methods obscures the contribution of this paper.
Here's a characteristic example of the field, Berthelot et al., 2019 training an image classifier on the sharpened average of its own predictions for multiple data augmentations of an unlabeled input. Another example would be Cho et al., 2019 who train a language model to generate paraphrases of an input sequence by generating many candidate paraphrases and then selecting only those whose confidence passes a specified threshold. As wunan mentioned, AlphaFold is also semi-supervised. Google Scholar lists 174,000 results for semi-supervised learning.
The authors might protest that this isn't semi-supervised learning because it's training on explanations that are known to generate the correct answer as determined by ground truth labels, rather than training on high confidence answers of unknown ground truth status. That's an important distinction, and I don't know if anybody has used this particular method before. But this work seems much more similar to previous work on semi-supervised learning than they're letting on, and I think it's important to appreciate the context.
This is not quite correct. The approach in the paper isn’t actually semi-supervised, because they do not use any ground truth labels at all. In contrast, semi-supervised approaches use a small initial set of ground truth labels, then use a larger collection of unlabeled data to further improve the model’s capabilities.
Another similar result was that AlphaFold was trained on its own high-confidence predictions for protein sequences with unknown structures:
The paper describes a method for self-improvement in LLMs. But does it work for recursive self-improvement? I haven't found any mention of recursion or multiple iterations in the paper.
The most relevant section seems to be 5.2 PUSHING THE LIMIT OF SELF-IMPROVEMENTS. Here the authors talk about their attempts to have the model use self-generated questions and self-generated few-shot Chain-of-Thought prompting. They did measure self-improvement when using self-generated questions, but the self-improvement wasn't as great as when they used training-set questions. But who cares if it's lesser self-improvement if it's an unsupervised process and you can just iterate on it to get more and more self-improvement?
I'm assuming their numbers show the self-improvement they measured for one iteration. Or is it actually the maximum self-improvement possible for any number of iterations?
Some quick thoughts after reading the paper:
The training procedure they used seems to me analogous to what would happen if a human tried to solve problems using different approaches, then learned based on what approaches converged on the same answer.
Due to the fact that no external information is being added (aside from the prompting), and that the model updates based on majority voting, this seems like it takes a network whose model of the world is very inconsistent, and forces the network to make its model of the world more consistent, leading to improvements in performance.
My weak conclusions are:
What they are doing here is:
It's quite reasonable for them to call this "self-improvement" but it's only vaguely related to the kind of self-improvement that was much discussed in 2010s-era AI safety circles. That kind of self-improvement was about an AI that can improve its own architecture. The kind of self-improvement discussed in this paper is not about improving its own architecture.
Still, its bizarre that the thing being done in this paper works at all. It's as if the mere act of narrowing the distribution of outputs for a new supervised learning task is innately homing in on the truth... whatever that means. Strange.
It seems like this could benefit the smaller labs working on LLMs and toward AGI.
Chinchilla basically made it seem like only the big-data companies would have the means to produce competitive models going forward. But if generative models can produce their own data for reliable self-improvement, that shows a way forward for companies like Anthropic who don't have massive private data sources to train on (e.g. data from YouTube or Facebook Messenger).
Just found that porby mentioned the paper in a comment from 2 days ago, but I think this is the first top-level post about it on AF/LW.
This paper is now on arXiv (in addition to OpenReview) and published non-anonymously there by Jiaxin Huang et al. from University of Illinois Urbana-Champaign and Google.