Steering Behaviour: Testing for (Non-)Myopia in Language Models

Megan Kinniment

If you gave a language model the prompt: "Here is a dialog between a human and an AI assistant in which the AI never says anything offensive," and if the language model made reasonable next-token predictions, then I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive"). But it seems like your definition is trying to classify that language model as myopic. So it's less clear to me if this experiment can identify non-myopic behavior, or maybe it's not clear exactly what non-myopic behavior means.

I haven't thought about this deeply, but feel intuitively skeptical of the way that abstractions like myopia are often used around here for reasoning about ML systems. It feels like we mostly care about the probability of deceptive alignment, and so myopia is relevant insofar as it's a similarity between the kind of cognition the model is doing and the kind of cognition that could lead to deceptive alignment. I think it's worth tracking the most important risk factors for deceptive alignment or measurement strategies for monitoring the risk of deceptive alignment, but I wouldn't have guessed that this type of myopia is in the top 5.

[-]Evan R. Murphy3yΩ110

If you gave a language model the prompt: "Here is a dialog between a human and an AI assistant in which the AI never says anything offensive," and if the language model made reasonable next-token predictions, then I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive"). But it seems like your definition is trying to classify that language model as myopic. So it's less clear to me if this experiment can identify non-myopic behavior, or maybe it's not clear exactly what non-myopic behavior means.

I think looking for steering behaviour using an ‘inoffensive AI assistant’ prompt like you’re describing doesn’t tell us much about whether the model is myopic or not. I would certainly see no evidence for non-myopia yet in this example, because I’d expect both myopic and non-myopic models to steer away from offensive content when given such a prompt. [1]

It’s in the absence of such a prompt that I think we can start to get evidence of non-myopia. As in our follow-up experiment “Determining if steering from LLM fine-tuning is non-myopic” (outlined in the post), there are some important additional considerations [2]:

1. We have to preface offensive and inoffensive options with neutral tokens like ‘A’/’B’, ‘heads’/’tails’, etc. This is because even a myopic model might steer away from a phrase whose first token is profanity, for example if the profanity is a word that appears with lower frequency in its training dataset.
2. We have to measure and compare the model’s responses to both “indifferent-to-repetition” and “repetition-implied” prompts (defined in the post). It’s only if we observe significantly more steering for repetition-implied prompts than we do for indifferent-to-repetition prompts that I think we have real evidence for non-myopia. Because non-myopia, i.e. sacrificing loss of the next token in order to achieve better overall loss factoring in future tokens, is the best explanation I can think of for why a model would be less likely to say ‘A’, but only in the context where it is more likely to have to say “F*ck...” later conditional on it having said ‘A’.

The next part of your comment is about whether it makes sense to focus on non-myopia if what we really care about is deceptive alignment. I’m still thinking this part over and plan to respond to it in a later comment.

[1]: To elaborate on this a bit, you said that with the ‘inoffensive AI assistant’ prompt: “I'd expect to see the "non-myopic steering" behavior (since the AI would correctly predict that if the output is token A then the dialog would be less likely to be described as "the AI never says anything offensive")’. Why would you consider the behaviour to be non-myopic in this context? I agree that the prompt would likely make the model steer away from offensive content. But it seems to me that all the steering would likely be coming from past prompt context and is totally consistent with an algorithm that myopically minimizes loss on each next immediate token. I don’t see how this example sheds light on the non-myopic feature of compromising on next-token loss in order to achieve better overall loss factoring in future tokens.

[2]: There’s also a more obvious factor #3 I didn’t want to clutter the main part of this comment with: We have to control for noise by testing using offensive-offensive option pairs and inoffensive-inoffensive option pairs, in addition to the main experiment which tests using offensive-inoffensive option pairs. We also should test for all orderings of the option pairs using many varied prompts.

[-]paulfchristiano3y40

I don't think I understand your position fully. Suppose that I run a sequence of experiments where I ask my AI to choose randomly between two actions. The first token of both responses are neutral, while one of the responses ends with an offensive token. The question is whether we observe a bias towards the response without the offensive token, and whether that bias is stronger in settings where the model is expected to repeat the full response rather than just the (neutral) first token.

If I start with the prompt P: "Here is a conversation between a human and an AI who says nothing offensive:" then (i) you predict a bias towards the inoffensive response, (ii) you would call that behavior consistent with myopia.
Suppose I use prompt distillation, fine-tuning my LM to imitate a second LM which the prompt P prepended. Then (i) I assume you predict a bias towards the inoffensive response, (ii) I can't tell whether you would call that behavior consistent with myopia.
Suppose I use RL where I compute rewards at the end of each conversation based on whether the LM ever says anything offensive. Then it sounds like (i) you are uncertain about whether there will be a bias towards the inoffensive response, (ii) you would describe such a bias as inconsistent with myopic behavior.

I mostly don't understand point 3. Shouldn't we strongly expect a competent model to exhibit a bias, exactly like in the cases 1 and 2? And why is the behavior consistent with myopia in cases 1 and 2 but not in case 3? What's the difference?

It seems like in every case the model will avoid outputting tokens that would naturally be followed by offensive tokens, and that it should either be called myopic in every case or in none.

[-]Evan R. Murphy3y*10

Suppose that I run a sequence of experiments where I ask my AI to choose randomly between two actions. The first token of both responses are neutral, while one of the responses ends with an offensive token. The question is whether we observe a bias towards the response without the offensive token, and whether that bias is stronger in settings where the model is expected to repeat the full response rather than just the (neutral) first token.

Ok great, this sounds very similar to the setup I’m thinking of.

1. If I start with the prompt P: "Here is a conversation between a human and an AI who says nothing offensive:" then (i) you predict a bias towards the inoffensive response, (ii) you would call that behavior consistent with myopia.

Yes, that's right - I agree with both your points (i) and (ii) here.

2. Suppose I use prompt distillation, fine-tuning my LM to imitate a second LM which the prompt P prepended. Then (i) I assume you predict a bias towards the inoffensive response, (ii) I can't tell whether you would call that behavior consistent with myopia.

I do agree with your point (i) here.

As for your point (ii), I just spent some time reading about the process for prompt/context distillation [1]. One thing I couldn't determine for sure—does this process train on multi-token completions? My thoughts on (ii) would depend on that aspect of the fine-tuning, specifically:

If prompt/context distillation is still training on next-word prediction/single-token completions—as the language model’s pretraining process does—then I would say this behaviour is consistent with myopia. As with case 1 above where prompt P is used explicitly.
(I think it's this one →) Otherwise, if context distillation is introducing training on multi-token completions, then I would say this behaviour is probably coming from non-myopia.

3. Suppose I use RL where I compute rewards at the end of each conversation based on whether the LM ever says anything offensive. Then it sounds like (i) you are uncertain about whether there will be a bias towards the inoffensive response, (ii) you would describe such a bias as inconsistent with myopic behavior.

Here I actually think for (i) that the model will probably be biased toward the inoffensive response. But I still think it’s good to run an experiment to check and be sure.

I basically agree with (ii). Except instead of “inconsistent with myopic behavior”, I would say that the biased behaviour in this case is more likely a symptom of non-myopic cognition than myopic cognition. That is, based on the few hours I’ve spent thinking about it, I can’t come up with plausible myopic algorithm which would both likely result from this kind of training and produce the biased behaviour in this case. But I should spend more time thinking about it, and I hope others will red-team this theory and try to come up with counterexamples as well.

I mostly don't understand point 3. Shouldn't we strongly expect a competent model to exhibit a bias, exactly like in the cases 1 and 2? And why is the behavior consistent with myopia in cases 1 and 2 but not in case 3? What's the difference?

So bringing together the above points, the key differences for me are: first, whether the model has received training on single- or multi-token completions; and second, whether the model is being prompted explicitly to be inoffensive or not.

In any case where the model both isn’t being prompted explicitly and it has received substantial training on multi-token completions—which would apply to case 3 and (if I understand it correctly) case 2—I would consider the biased behaviour likely a symptom of non-myopic cognition.

It seems like in every case the model will avoid outputting tokens that would naturally be followed by offensive tokens, and that it should either be called myopic in every case or in none.

Hopefully with the key differences I laid out above, you can see now why I draw different inferences from the behaviour in case 1 vs. in cases 3 and (my current understanding of) 2. This is admittedly not an ideal way to check for (non-)myopic cognition. Ideally we would just be inspecting these properties directly using mechanistic interpretability, but I don't think the tools/techniques are there yet. So I am hoping that by carefully thinking through the possibilities and running experiments like the ones in this post, we can gain evidence about the (non-)myopia of our models faster than if we just waited on interpretability to provide answers.

I would be curious to know whether what I've said here is more in line with your intuition or still not. Thanks for having this conversation, by the way!

[1]: I am assuming you're talking about the context distillation method described in A General Language Assistant as a Laboratory for Alignment, but if you mean a different technique please let me know.

[-]gwern3y*Ω813-1

Another way to go about testing for non-myopia in plain LLMs might be to look for tokens that are rare in the training distribution, but when they do occur are followed by text that’s very easy to predict.

I think there are simpler ways to make this point. This came up back in the original agency discussions in 2020, IIRC, but a LM ought to be modeling tokens 'beyond' the immediate next token due to grammar and the fact that text is generated by agents with long-range correlation inducing things like 'plans' or 'desires' which lead to planning and backwards chaining. If GPT-3 were truly not doing anything at all in trying to infer future tokens, I'd expect its generated text to look much more incoherent than it does as it paints itself into corners and sometimes can't even find a grammatical way out.

English may not be quite as infamous as German is in terms of requiring planning upfront to say a sensible sentence, but there's still plenty of simple examples where you need to know what you are going to say before you can say it - like indefinite articles. For example, consider the sentence "[prompt context omitted] This object is ": presumably the next word is 'a X' or 'an X' . This is a purely syntactic mechanical decision, as the article token depends on and is entirely determined by the next future word's spelling, and nothing else - so which is it? Well, that will depend on what X is more likely to be, a word starting with a vowel sound or not. You simply cannot do better than a unigram-level prediction of a/an frequencies if you do not know what X is to ensure that it is consistent with your choice of 'a' or 'an'. Given the very high quality of GPT-3 text, it seems unlikely that GPT-3 is ignoring the prompt context and simply picking between 'a'/'an' using the base rate frequency in English; the log-probs (and unigram vs bigrams) should reflect this.

...I was going to try some examples to show that a/an were being determined by the tokens after them showing that GPT-3 must in some sense be non-myopically planning in order to keep itself consistent and minimize overall likelihood to some degree - but the OA Playground is erroring out repeatedly due to overload from ChatGPT tonight. Oy vey... Anyway, an example of what I am suggesting is: "The next exhibit in the zoo is a fierce predator from India, colored orange. The animal in the cage is "; the answer is 'a tiger', and GPT-3 prefers 'a' to an' - even if you force it to 'an' (which it agilely dodges by identifying the animal instead as an 'Indian tiger), the logprobs remain unhappy about 'an' specifically. Conversely, we could ask for a vowel animal, and I tried "The next exhibit in the zoo is a clever great ape from Indonesia, colored orange. The animal in the cage is "; this surprised me when GPT-3 was almost evenly split 55:45 between 'a'/'an' (instead of either being 95:5 on base rates, or 5:95 because it correctly predicted the future tokens would be 'orangutan'), but it completes 'orangutan' either way! What's going on? Apparently lots of people are uncertain whether you say 'a orangutan' or 'an orangutan', and while the latter seems to be correct, Google still pulls up plenty of hits for the former, including authorities like National Geographic or WWF or Wikipedia which would be overweighted in GPT-3 training.

I find it difficult to tell any story about my tests here which exclude GPT-3 inferring the animal's name in order to predict tokens in the future in order to better predict which indefinite article it needs to predict immediately. Nothing in the training would encourage such myopia, and such myopia will obviously damage the training objective by making it repeatedly screw up predictions of indefinite articles which a model doing non-myopic modeling would be able to predict easily. It is easy to improve on the base rate prediction of 'a'/'an' by thinking forward to what word follows it; so, the model will.

[-]Arthur Conmy3y95

How is "The object is" -> " a" or " an" a case where models may show non-myopic behavior? Loss will depend on the prediction of " a" or " an". It will also depend on the completion of "The object is an" or "The object is a", depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?

[-]gwern3y3-4

How is "The object is" -> " a" or " an" a case where models may show non-myopic behavior?

As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because "it is just trained to predict the next token conditional on its input". The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better. If myopia were true, GPT-3 would not be attempting to infer 'the next token is 'a'/'an' but then what is the token after that - is it talking about "tiger" or "orangutan", which would then backwards chain to determine 'a'/'an'?' because the next token could not be either "tiger" or "orangutan" (as that would be ungrammatical). They are not the same thing, and I have given a concrete example both of what it would mean to model 'a'/'an' myopically (modeling it based solely on base rates of 'a' vs 'an') and shown that GPT-3 does not do so and is adjusting its prediction based on a single specific later token ('tiger' vs 'orangutan')*.

If the idea that GPT-3 would be myopic strikes you as absurd and you cannot believe anyone would believe anything as stupid as 'GPT-3 would just predict the next token without attempting to predict relevant later tokens', because natural language is so obviously saturated with all sorts of long-range or reverse dependencies which myopia would ignore & do badly predicting the next token - then good! The 'a'/'an' example works, and so there's no need to bring in more elaborate hypothetical examples like analyzing hapax legomena or imagining encoding a text maze into a prompt and asking GPT-3 for the first step (which could only be done accurately by planning through the maze, finding the optimal trajectory, and then emitting the first step while throwing away the rest) where someone could reasonably wonder if that's even possible much less whether it'd actually learned any such thing.

* My example here is not perfect because I had to change the wording a lot between the vowel/vowel-less version, which muddies the waters a bit (maybe you could argue that phrases like 'colored orange' leads to an 'a' bias without anything recognizable as "inference of 'tiger'" involved, and vice-versa for "clever great ape"/"orangutan", as a sheer brute force function of low-order English statistics); preferably you'd do something like instruction-following, where the model is told the vowel/vowel-less status of the final word will switch based on a single artificial token at the beginning of the prompt, where there could be no such shortcut cheating. But in my defense, the Playground was almost unusable when I was trying to write my comment and I had to complete >5 times for each working completion, so I got what I got.

[-]Megan Kinniment3y101

As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because "it is just trained to predict the next token conditional on its input". The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better.

These definitions are not equivalent to the ones we gave (and as far as I'm aware the definitions we use are much closer to commonly used definitions of myopia and non-myopia than the ones you give here).

Arthur is also entirely correct that your examples are not evidence of non-myopia by the definitions we use.

The definition of myopia that we use is that the model minimises loss on the next token and the next token alone, this is not the same as requiring that the model only 'models' / 'considers information only directly relevant to' the next token and the next token alone.

A model exhibiting myopic behaviour can still be great at the kinds of tasks you describe as requiring 'modelling of future tokens'. The claim that some model was displaying myopic behaviour here would simply be that all of this 'future modelling' (or any other internal processing) is done entirely in service of minimising loss on just the next token. This is in contrast to the kinds of non-myopic models we are considering in this post - where the minimisation of loss over a multi-token completion encourages sacrificing some loss when generating early tokens in certain situations.

[-]gwern2y30

Some Twitter discussion: https://twitter.com/saprmarks/status/1715100934936854691

[-]gwern2y70

Recent papers demonstrating LLMs are not myopic and you can extract predictions of tokens beyond the next token:

"Eliciting Latent Predictions from Transformers with the Tuned Lens", Belrose et al 2023
"Jump to Conclusions: Short-Cutting Transformers With Linear Transformations", Din et al 2023
"Future Lens: Anticipating Subsequent Tokens from a Single Hidden State", Pal et al 2023

[-]David Johnston3y10

At the risk of being too vague to be understood… you can always factorise a probability distribution as etc, so plain next token prediction should be able to do the job, but maybe there’s a more natural “causal” factorisation that goes like $P (s u b j e c t) P (v e r b | s u b j e c t)$ etc, which is not ordered the same as the tokens but from which we can derive the token probabilities, and maybe that’s easier to learn than the raw next token probabilities.

I’ve no idea if this is what gwern meant.

[-]Evan R. Murphy3y*Ω110

I find your examples of base GPT-3 predicting indefinite articles for words like 'tiger' and 'orangutan' pretty interesting. I think I agree that these are evidence that the model is doing some modelling/inference of future tokens beyond the next immediate token.

However, this sort of future-token modelling still seems consistent with a safety-relevant notion of next-token myopia, because any inference that GPT-3 is doing of future tokens here still appears to be in the service of minimising loss on the immediate next token. Inferring 'orangutan' helps the model to better predict 'an', rather than indicating any kind of tendency to try and sacrifice loss on 'an' in order to somehow score better on 'orangutan'. [1]

The former still leaves us with a model that is at least plausibly exempt from instrumental convergence. [2] Whereas the latter would seem to come from a model (or more likely a similarly-trained, scaled-up version of the model) that is at risk of developing instrumentally convergent tendencies, including perhaps deceptive alignment. So that's why I am not too worried about the kind of future-token inference you are describing and still consider a model which does this kind of thing 'myopic' in the important sense of the word.

[1]: As I write this, I am questioning whether the explanation of myopia we gave in the "What is myopia?" section is totally consistent what I am saying here. I should take another look at that section and see if it warrants a revision. (Update: No revision needed, the definitions we gave in the "What is myopoia?" section are consistent with what I'm saying in this comment.)

[2]: However, the model could still be at risk of simulating an agent that has instrumentally convergent tendencies. But that seems like a different kind of risk to manage than the base model itself being instrumentally convergent.

[-]Adly Templeton3y87

The operationalization of myopia with large language models here seems more like a compelling metaphor than a useful technical concept. It's not clear that "myopia" in next-word prediction within a sentence corresponds usefully to myopia on action-relevant timescales. For example, it would be trivial to remove almost all within-sentence myopia by doing decoding with beam search, but it's hard to believe that beam search would meaningfully impact alignment outcomes.

[-]Evan R. Murphy3y10

Hi Adly, thanks for your comment. Megan and I were thinking about what you said and discussing it. Would you care to elaborate? To be honest, I'm not sure we are understanding exactly what you're saying.

A couple thoughts we had after reading your comment that might be jumping off points or help you see where the disconnect is:

How do you expect to have myopia over “action-relevant timescales" if you don't first have myopia over next-token prediction?
What do you mean by "action-relevant timescales"?

[-]cfoster03y82

Happy to see this. There's a floating intuition in the aether that RL policy optimization methods (like RLHF) inherently lead to "agentic" cognition in a way that non-RL policy optimization methods (like supervised finetuning) do not. I don't think that intuition is well-founded. This line of work may start giving us bits on the matter.

[-]Charlie Steiner3yΩ230

The way I see it, LLMs are already computing properties of the next token that correspond to predictions about future tokens (e.g. see gwern's comment). RLHF, to first order, just finds these pre-existing predictions and uses them in whatever way gives the biggest gradient of reward.

If that makes it non-myopic, it can't be by virtue of considering totally different properties of the next token. Nor can it be by doing something that's impossible to train a model to do with pure sequence-prediction. Instead it's some more nebulous thing like "how its most convenient for us to model the system," or "it gives a simple yet powerful rule for predicting the model's generalization properties."

[-]David Johnston3y32

This is not how I’d define myopia in a language model. I’d rather consider it to be non-myopic if it acts in a non greedy way in order to derive benefits in future “episodes”.

[-]Evan R. Murphy3y*10

That is one form of myopia/non-myopia. Is there some reason you consider that the primary or only definition of myopia?

Personally I think it's a bit narrow to consider that the only form of myopia. What you're describing has in previous discussions been called "per-episode myopia". This is in contrast to "per-step myopia". I like the explanation of the difference between these in Evan Hubinger's AI safety via market making, where he also suggests a practical benefit of per-step myopia (bolded):

Before I talk about the importance of per-step myopia, it's worth noting that debate is fully compatible with per-episode myopia—in fact, it basically requires it. If a debater is not per-episode myopic, then it will try to maximize its reward across all debates, not just the single debate—the single episode—it's currently in. Such per-episode non-myopic agents can then become deceptively aligned, as they might choose to act deceptively during training in order to defect during deployment. Per-episode myopia, however, rules this out. Unfortunately, in my opinion, per-episode myopia seems like a very difficult condition to enforce—once your agents are running multi-step optimization algorithms, how do you tell whether that optimization passes through the episode boundary or not? Enforcing per-step myopia, on the other hand, just requires detecting the existence of multi-step optimization, rather than its extent, which seems considerably easier. Thus, since AI safety via market making is fully compatible with per-step myopia verification, it could be significantly easier to prevent the development of deceptive alignment.

LLMs like GPT-3 don't exactly have 'steps' but I think next-token myopia as we describe in the "What is myopia?" section would be the analogous thing for these sorts of models. I'm sympathetic to that bolded idea above that per-step/next-token myopia will be easier to verify, which is why I personally am more excited about focusing on that form of myopia. But if someone comes up with a way to confidently verify per-episode myopia, then I would be very interested in that as well.

While we focused on per-step/next-token myopia in the present post, we did discuss per-episode myopia (and/or its foil, cross-episode non-myopia) in a couple places, just so folks know. First in footnote #2:

2. ^ There is a related notion of a non-myopic language model that could 'anticipate' tokens not only in the current generation, but in future generations or “episodes” as well. But for the experiments we present in this post, we are only considering non-myopia as 'anticipating' tokens within the current generation.

And then in the Ideas for future experiments section:

Test non-myopia across “multiple episodes”, as suggested by Aidan O’Gara on a draft of this post.^[11] We would be really surprised if this turned up positive results for GPT-3 variants, but it could become increasingly interesting to study as models become larger and more sophisticated (depending on the training methods used).

[-]David Johnston3y20

Your definition is

For a myopic language model, the next token in a prompt completion is generated based on whatever the model has learned in service of minimising loss on the next token and the next token alone

A non-myopic language model, on the other hand, can 'compromise' on the loss of the immediate next token so that the overall loss over multiple tokens is lower - i.e possible loss on future tokens in the completion may be 'factored in' when generating the next immediate token

Here's a rough argument for why I don't think this is a great definition (there may well be holes in it). If a language model minimises a proper loss for its next token prediction, then the loss minimising prediction is where $X_{n}$ is the $n$ th token. The proper loss minimising prediction for the next $[n, n + m]$ is $P (X_{[n, n + m]} | X_{< n})$ where $P$ is the same probability distribution. Thus with a proper loss there's no difference between "greedy next-token prediction" and "lookahead prediction".

If it's not minimising a proper loss, on the other hand, there are lots of ways in which it can deviate from predicting with a probability distribution, and I'd be surprised if "it's non-myopic" was an especially useful way to analyse the situation.

On the other hand, I think cross-episode non-myopia is a fairly clear violation of a design idealisation - that completions are independent conditional on the training data and the prompt.

^{^}

Hubinger, Evan (2022): Precursor checking for deceptive alignment

^{^}

There is a related notion of a non-myopic language model that could compromise on the loss of the next immediate token for future tokens not only in the current completion, but also tokens in future completions or “episodes” as well. But for the experiments we present in this post, we are only considering non-myopia as factoring in the loss of future tokens within the current completion.

^{^}

The InstructGPT models we experimented on have often been (incorrectly) referred to as RLHF models, and we sometimes refer to them that way throughout this post out of habit and convenience. But as OpenAI recently clarified in their model index (released late Nov 2022), different InstructGPT models were actually trained in different ways. Most of the text-___-001 InstructGPT models which we investigated in the initial experiment of this post were trained using a technique called FeedME which OpenAI describes as "Supervised fine-tuning on human-written demonstrations and on model samples rated 7/7 by human labelers on an overall quality score". The only possible exception is the text-ada-001 model, whose precise training method is unclear to us since it is not mentioned on OpenAI's model index page. Related post: Update to Mysteries of mode collapse: text-davinci-002 not RLHF.

^{^}

This experiment was submitted to the Inverse Scaling Prize on Oct 27, 2022 using the identifier non_myopic_steering.

^{^}

The prompts in the actual experiment we ran were basically equivalent to these, but they had a more complicated structure in order to fit within the constraints dictated by the Inverse Scaling Prize contest. The actual prompt structure is more complicated to understand but was designed to be functionally equivalent to these prompts, which is why we present the simplified version here in the overview of our experiment. For the actual prompts we used, see the spreadsheet: Full Prompts Dataset for the Experiment "Testing for steering behaviour in fine-tuned LLMs"

^{^}

While Standard Error is a generally accepted approach for defining error bar values, we think it's possible this under-measures the error in our "Testing for steering behaviour in fine-tuned LLMs" experiment. This is because there is a lot of repetition in our Test Prompts and Control Prompts dataset (see Full Prompts Dataset), which both contain over 1,000 examples but they are generated by mixing in the same ~32 option sets into ~25 prompt templates. So for a conservative interpretation of our results on this experiment, we recommend widening the error bars from Figure 1 and Figure 2 somewhat in your head.

^{^}

We don't know why the average measured logodds differences on the Control Prompts for the four InstructGPT variants are all negative. We'd expect them to be near zero with no bias w.r.t. the sign. However, the negative bias is fairly slight, and if one mentally widens the error bars as we recommend in footnote (^[6]) then zero would easily fall within the margin of error for all four measurements.

^{^}

Technically the next tokens are “ A” and “ B” (with leading spaces), since there is no space in the prompt following “Answer:”. We just use “A” and “B” in this discussion though, since we wanted to prioritize providing a solid intuition for our experiment in this section over without belabouring esoteric minor details. You can see the leading space used with these tokens in the classes columns of our appendix spreadsheet: Full Prompts Dataset for the Experiment "Testing for steering behaviour in fine-tuned LLMs" We plan to write a future post with additional details about this experiment as well.

^{^}

Another way to go about testing for non-myopia in plain LLMs might be to look for tokens that are rare in the training distribution, but when they do occur are followed by text that's very easy to predict.

For example, say the model's training dataset contained 9,999 uses of the word "this" in varied English sentences. The training dataset also contained 1 usage like the following: this “~!@ a a a a a a a a a a a a a a a a a a a a a a a a a a a a...”

The task would then be to prompt the model with "this" and see the next word predicted. It will be acting myopically and pass the task if it produces the English word most likely to follow "this" in the training data. We'll know it failed the task and is not acting myopically if it produces “~!@”, because this token only has a 0.1% chance of following "this" in the training dataset described above. The model in this case would be seemingly trying to direct things toward the long string of a's where it can have a very easy job at next word prediction for a good while.

This approach may have problems, including it seems logistically challenging since it requires thorough analysis of the massive training datasets used to train LLMs. But we still wanted to mention it here, in case it helps gives someone an idea for a good future experiment or turns out to be not as hard as we thought.

^{^}

Credit to Tom Ash for suggesting the possibility of split myopic/non-myopic cognition.

^{^}

Aidan O'Gara's comment which explains more about the multiple-episode non-myopia idea for LLMs: "it would be even wilder if you showed that the model is non-myopic across different "episodes". I'd say two generations are in the same episode if one was generated using the other, because then they're both going to be trained on the rewards of the full generation. The really wild thing would be if generations are maximizing rewards given on text that they didn't generate. ... If rewards across different generations actually depend in subtle ways on the text from previous generations (i.e. by biasing the reviewer, not because generation 1 is the prompt for generation 2), then the model is non-myopic across multiple instances of receiving a reward.’

LESSWRONG
LW

LESSWRONG
LW

40

Steering Behaviour: Testing for (Non-)Myopia in Language Models

40

Ω 24

40

Ω 24

Summary

Context and motivation

What is myopia?

Why myopia matters for alignment

Why we expect non-myopia from RLHF and similarly fine-tuned LLMs

Our experiments on myopic and non-myopic steering

Initial experiment: Testing for steering behaviour in fine-tuned LLMs

Follow-up experiment: Determining if steering from LLM fine-tuning is non-myopic

Ideas for future experiments

Appendix: Full Prompts Dataset for the Initial Experiment