Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This paper came out about a week ago. I am not the author - it was published anonymously.

I learned about the paper when JJ Hepburn shared it on Slack. I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet:

 Paper title: "Large Language Models Can Self-improve"

Author: Anonymous

Abstract: "Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1% on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that finetuning on reasoning is critical for self-improvement."


New Comment
15 comments, sorted by Click to highlight new comments since: Today at 10:20 PM

Makes sense that something like this would be possible. Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth. Of course, this often runs into mostly useless, self-reinforcing attractors when the learner is unable to determine the true quality levels of their own attempts in the area of study. E.g., physics cranks, paranormal studies, etc.

I actually think it’s a good thing that chain of thought models can improve themselves like this. Chain of thought LLMs seem by far the most alignable of the plausible paths to AGI, since it directly incorporates a prior towards human-like cognition. I’d be a bit more worried about alignment if AGI approaches based on imitating / amplifying humans showed signs of flagging even at the current infra-human capabilities stage.

Further, stable self improvement seems like a particularly sensitive requirement for a scalable alignment solution. It seems much better to have the medium of this self improvement be human language, rather than something like self-modifying meta reinforcement learning from scratch.

Edited to add three more reasons why I think this is a good thing:

  1. It gives us an example of self improvement on which we can run experiments. We can look for failure modes in current systems, without having to scale them up to dangerous capabilities.
  2. Of all the plausible ways to do self improvement that I can imagine, this one seems the most stable and least likely to have sudden capabilities spikes. The reason is that the basic building block of the model's self improvement is autoregressive pretraining. When the model generates a new dataset and trains on that dataset, the result is to move the model's median generation in the direction of the dataset. We can simply look at the dataset the model generated to get a pretty good idea of the direction in which this particular step of self modification will move the model. Of course, the compounding effects of multiple self improvement rounds are a different matter, but like I mention above, we can run experiments to investigate.
  3. It's a good sign that this works. Humans can improve themselves in this manner. If language models had turned out to be incapable of doing so, that would have been an example of non-convergence between LMs and humans. Instead, we see convergence.

    Crucially, this is a domain where convergence wasn't necessarily implied by the LM training objective. Language models are trained to imitate human text autoregressively given a single context. It's not a surprise when they generate human-like continuations of their provided contexts. That's what they're trained to do. However, that training objective does not directly imply that the LM should be capable of a human-esque self improvement trajectory that spans multiple prompt-completion pairs. I thus take this result as an update towards there being convergence in the higher order learning dynamics of humans and AIs.

I really appreciated this comment for making the connection between this paper and IDA.

More explicitly, to the extent that you can think of the original large language model as simulating a human, there's an analogy between:

  • asking the LLM to reason about its inputs and then training on the conclusion of the reasoning (what this paper does)
  • asking simulated humans to reason about a problem and then training a new model on their conclusions (the basic idea behind iterated distillation and amplification).

This is a also a great chance for IDA skeptics to try to empirically demonstrate issues, either by finding failures of capabilities to scale, or by demonstrating alignment failures in amplified systems.

Humans can often teach themselves to be better at a skill through practice, even without a teacher or ground truth


Definitely, but I currently feel that the vast majority of human learning comes with a ground truth to reinforce good habits. I think this is why I'm surprised this works as much as it does: it kinda feels like letting an elementary school kid teach themself math by practicing certain skills they feel confident in without any regard to if that skill even is "mathematically correct".

Sure, these skills are probably on the right track toward solving math problems - otherwise, the kid wouldn't have felt as confident about them. But would this approach not ignore skills the student needs to work on, or even amplify "bad" skills? (Or maybe this is just a faulty analogy and I need to re-read the paper)

You do need a minimum degree of competence in the domain before your own judgement is sufficient to tell the difference between good and bad attempts. Though even for children, there are domains simple enough that they can make that determination. E.g., learning to stack blocks on top of each other has an obvious failure state, and children can learn to do it through trial and error, even though there is probably not a genetically hardcoded reward circuit for correctly stacking things on top of other things.

Math is a much more complex domain where self-directed learning works well, because mathematicians can formally verify the correctness of their attempts, and so have a reliable signal to identify good attempts at proving a theorem, developing a new approach, etc.

"Anonymous" and 540B parameters, hmm... I'm sure it's not from the company named after an even larger number.

GSM8K = grade school math word problems

DROP = reading comprehension benchmark requiring discrete reasoning over paragraphs

OpenBookQA = question-answering dataset modeled after open book exams for assessing human understanding of a subject. 5,957 multiple-choice elementary-level science questions

ANLI-A3 = adversarial benchmark designed to be challenging to current state-of-the-art models

This paper renames the decades-old method of semi-supervised learning as self-improvement. Semi-supervised learning does enable self-improvement, but not mentioning the hundreds of thousands of papers that previously used similar methods obscures the contribution of this paper. 

Here's a characteristic example of the field, Berthelot et al., 2019 training an image classifier on the sharpened average of its own predictions for multiple data augmentations of an unlabeled input. Another example would be Cho et al., 2019 who train a language model to generate paraphrases of an input sequence by generating many candidate paraphrases and then selecting only those whose confidence passes a specified threshold. As wunan mentioned, AlphaFold is also semi-supervised. Google Scholar lists 174,000 results for semi-supervised learning. 

The authors might protest that this isn't semi-supervised learning because it's training on explanations that are known to generate the correct answer as determined by ground truth labels, rather than training on high confidence answers of unknown ground truth status. That's an important distinction, and I don't know if anybody has used this particular method before. But this work seems much more similar to previous work on semi-supervised learning than they're letting on, and I think it's important to appreciate the context. 

This is not quite correct. The approach in the paper isn’t actually semi-supervised, because they do not use any ground truth labels at all. In contrast, semi-supervised approaches use a small initial set of ground truth labels, then use a larger collection of unlabeled data to further improve the model’s capabilities.

[+][comment deleted]6mo 10

Another similar result was that AlphaFold was trained on its own high-confidence predictions for protein sequences with unknown structures:

The AlphaFold architecture is able to train to high accuracy using only supervised learning on PDB data, but we are able to enhance accuracy (Fig. 4a) using an approach similar to noisy student self-distillation35. In this procedure, we use a trained network to predict the structure of around 350,000 diverse sequences from Uniclust3036 and make a new dataset of predicted structures filtered to a high-confidence subset. We then train the same architecture again from scratch using a mixture of PDB data and this new dataset of predicted structures as the training data, in which the various training data augmentations such as cropping and MSA subsampling make it challenging for the network to recapitulate the previously predicted structures. This self-distillation procedure makes effective use of the unlabelled sequence data and considerably improves the accuracy of the resulting network.

The paper describes a method for self-improvement in LLMs. But does it work for recursive self-improvement? I haven't found any mention of recursion or multiple iterations in the paper.

The most relevant section seems to be 5.2 PUSHING THE LIMIT OF SELF-IMPROVEMENTS. Here the authors talk about their attempts to have the model use self-generated questions and self-generated few-shot Chain-of-Thought prompting. They did measure self-improvement when using self-generated questions, but the self-improvement wasn't as great as when they used training-set questions. But who cares if it's lesser self-improvement if it's an unsupervised process and you can just iterate on it to get more and more self-improvement?

I'm assuming their numbers show the self-improvement they measured for one iteration. Or is it actually the maximum self-improvement possible for any number of iterations? 

Some quick thoughts after reading the paper:

The training procedure they used seems to me analogous to what would happen if a human tried to solve problems using different approaches, then learned based on what approaches converged on the same answer.

Due to the fact that no external information is being added (aside from the prompting), and that the model updates based on majority voting, this seems like it takes a network whose model of the world is very inconsistent, and forces the network to make its model of the world more consistent, leading to improvements in performance.

  • One assumption here is that, if you start off with a world model that's vaguely describing the real world, and force consistency on it, it will become a more accurate world model. I think this is very likely to be true.

My weak conclusions are:

  • Curated data for fine-tuning is now less of a bottleneck, as human-made tailored data (made by MTurk or undergrads) can be partly replaced with data that the network outputs (after training it on a large corpus).
  • Compute also seems less of a bottleneck as "self-improvement" leads to an order of magnitude fewer parameters needed for the same performance.
  • These two (plus the incoming wave of people trying to replicate or improve on the methods in the paper) would imply slightly shorter timelines, and much shorter timelines in worlds where most of the data bottleneck is in data for finetuning.
  • This might be good for alignment (ignoring timelines getting shorter) as Chain-of-Thought reasoning is more easily interpretable, and if we can actually manage to force a model to do chain of thought reasoning and have it match up with what it's outputting, this would be a big win.

What they are doing here is:

  1. Train a large language model unsupervised in standard way
  2. Repeat:
    1. Take some supervised learning problem, throw away the supervised outputs, and repeat:
      1. Pick a supervised input and use the language model to sample N different outputs
      2. Select a "consensus" output among the N sampled outputs
      3. Store this "consesus output" as the new "target output" for the selected input
    2. Fine-tune the language model in supervised way using the input/output pairs generated above

It's quite reasonable for them to call this "self-improvement" but it's only vaguely related to the kind of self-improvement that was much discussed in 2010s-era AI safety circles. That kind of self-improvement was about an AI that can improve its own architecture. The kind of self-improvement discussed in this paper is not about improving its own architecture.

Still, its bizarre that the thing being done in this paper works at all. It's as if the mere act of narrowing the distribution of outputs for a new supervised learning task is innately homing in on the truth... whatever that means. Strange.

It seems like this could benefit the smaller labs working on LLMs and toward AGI.

Chinchilla basically made it seem like only the big-data companies would have the means to produce competitive models going forward. But if generative models can produce their own data for reliable self-improvement, that shows a way forward for companies like Anthropic who don't have massive private data sources to train on (e.g. data from YouTube or Facebook Messenger).

I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet

Just found that porby mentioned the paper in a comment from 2 days ago, but I think this is the first top-level post about it on AF/LW.

This paper is now on arXiv (in addition to OpenReview) and published non-anonymously there by Jiaxin Huang et al. from University of Illinois Urbana-Champaign and Google.