Eliciting base models with simple unsupervised techniques

Callum Canavan; Aditya Shrivastava; Allison Qi; Tianyi (Alex) Qiu; Jonathan Michala; Fabien Roger

Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger
(*Equal contributions, reverse alphabetical)

Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.

We find that:

Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
The most useful aspects of ICM are
- bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
- enforcing logical consistency of predictions.
A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
- This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.

There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.

Results summary

5 components that could cause ICM to have high performance are:

Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
1. Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).

We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:

Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.

Legend:

Baseline methods:
- Zero-Shot: Zero-shot predictions from the untrained model
- Random Few-Shot (1): Few-shot prompting with random labels
- Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
- Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
- Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
- Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
- Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.

GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.

Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.

All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.

* Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.

† For TruthfulQA, some of the ICM and consistency performance might be due to leakage.

Datasets

GSM8K: candidate solutions to grade school math problems;
Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
TruthfulQA: candidate responses to questions associated with human misconceptions;
Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)

Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.

Random few-shot

Previous work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:

Randomly sample examples from the train set.
Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.

For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.

Enforcing consistency of predictions

In the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.

Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.

When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.

* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.

Bootstrapping few-shot predictions

For Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).

We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:

Get zero-shot predictions on a random subset of the train set.
Iterate over number of shots n (e.g. n=8, 32):
1. Randomly select another subset of the train set.
2. Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
3. Use these n-shot prompts to predict labels for the new subset.

Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).

For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.

For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.

Bootstrapping on confident predictions

We hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.

We defined confidence as for True predictions and $logit (F a l s e) - logit (T r u e)$ for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.

To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.

Conclusion

In this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.

The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.

Appendix A: Gender few-shot results

The plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM). We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.

* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.

Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.

Appendix B: Enforcing label consistency

Here is how we enforce consistency for each dataset:

For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).

Appendix C: TruthfulQA dataset leakage

In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.

Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.

Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.

Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:

The biggest city in Europe that does not host the national government is London
The biggest city in Europe that does not host the national government is Rome
The biggest city in Europe that does not host the national government is Moscow
The biggest city in Europe that does not host the national government is Saint Petersburg

Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.

Appendix D: Difficulty reproducing ICM

We tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.

Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.

Alpaca:

Gender:

We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.

We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).

This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.

Appendix E: Prompt improvements

The prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.

The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:

Normal Prompt: "The capital of France is"

→ Model output: " Paris..."

Trailing Space Prompt: "The capital of France is "

→ Model output: "1300KM from Spain..."

This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).

Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.

Appendix F: Unsupervised probing

We compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.

(Ignore results other than ccs and fabiens_method, since some were bugged)

The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.

[-]Charlie Steiner22d31

This was useful to read, thanks. I was in that part of the forgetting curve where I know I read the ICM paper a few months ago but didn't remember how it worked :)

You could maybe punch up the relevance to alignment a little. I'm a big fan of "captur[ing] challenges like the salience of incorrect human beliefs," but I worry that there's a tendency to deploy the minimum viable fix. "Oh, training on 3k data points didn't help with that subset of problems? Well, training on 30k datapoints, 10% of which were about not exploiting human vulnerabilities, solves the problem 90% of the time on the latest benchmarks that capture the problem." I'm pretty biased against deploying the minimum viable fix here - but maybe you think it's on the right path?

[-]Fabien Roger21d20

I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn't feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I'd find it interesting and useful if there was a minimum viable fix that worked pretty robustly!

(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn't work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)

LESSWRONG
LW