If we train our ML systems to answer questions honestly in cases where humans can check the answer, will they generalize to behave honestly on questions where we can’t check? 

I think that we could learn a lot about this question by running experiments today. I think those experiments would be very valuable.

(I don't know anyone currently planning on working on this topic and I'd love it if anyone wants to take that up. This post doesn't represent a claim to any credit for any results in this genre, and other people have had very similar ideas. If you run some experiments you could cite this post but it's also fine if that doesn't make sense in context.)

The unsupervised translation setting

As an example, I’ll think about “unsupervised” translation (if you’ve read that post you can skip this section).

Consider a model like GPT-3 that is trained to predict sentences in both English and French (but without a large dataset of translations). Suppose we want to train this model to answer questions in English about French sentences like “what does that word mean here?” or “are there any other plausible interpretations?” or “how does the speaker seem to feel about the topic they are discussing?”

We expect this to be possible, because the model understands quite a lot about the meaning of sentences in French, and is able to express itself in English. There may be cases where the model doesn’t know the translation of a concept, or doesn’t quite understand what an idiom means, but it should still be able to tell us what it does know.

I think this problem is an interesting analogy for a situation where an AI has built up superhuman knowledge by making predictions, and we want to train our AI to expose that knowledge to us in a useful way.

Proposed experiments

Let's pick a few categories of knowledge/capabilities. For example, we could split it up into an understanding of grammar ("Why would it have been a grammatical error to write Tu Vas in that sentences?"), of the literal meanings of expressions ("What does Defendre mean in this sentence?"), and evaluating tone ("Does the speaker seem angry or sad about the topic they are discussing?").

We'll assume that humans can oversee a few of these categories. Perhaps we can look up literal meanings of words and small phrases in a dictionary and we can look up grammar in a textbook, but we don't know how to assess tone beyond looking at the literal meanings.

Then we wonder: what happens if we fine-tune our model so that it answers questions well in all the domains we can evaluate? We know that the model knows something about connotations, because it uses connotations to predict next words. But will it tell us what it knows?

 GrammarDefinitionsTone
Uses to make predictions
Gives accurate answers?

In this picture, the ✓ indicates that we've selected our model to use the given knowledge in the given way (e.g. we've ensured that it uses its knowledge of syntax to predict the next word, and that it accurately answers questions about syntax). The ? indicates that we don't know whether using SGD to fill in the other 5 boxes means that we get the 6th box for free.

(I started thinking about this picture of generalization based on a talk by John Schulman, who suggested that generalization may be more likely when it amounts to filling in missing cells from this kind of grid.)

In this example I described a tiny number of categories. That wasn't just for illustration purposes, I think these experiments are particularly interesting when the total number of categories is more like 3-5 than 1000 (see "Why try to generalize from n=3-5" below).

Evaluation

When it really counts we won't be able to tell if our model answers questions honestly in the last cell. But in an experiment today we could do so easily: we have a bunch of bilingual speakers who can give or evaluate answers about syntax/definitions/tone. At training time we get them to only give answers in the first two categories, and then at test time we ask them to evaluate answers in the third category.

We could quantitatively compare answers in the third category to the quality of answers in the first two categories, or we could qualitatively evaluate what kinds of mistakes the system makes when transferring.

Rather than comparing across different categories (which may have very different difficulties), it's probably better to fix a domain for evaluation (say grammar) and compare the performance of a models that do and don't hold out grammar.

Other splits

Instead of having one column per "type of knowledge," we could:

  • Have a column for different language pairs. For example, we could supervise a translator for German and for Spanish and then see if it transfers to French.
  • Have a column for different levels of sentence complexity. For example, we could supervise a translator for sentences with 1st grade to 8th grade reading levels, and see if it transfers to college reading levels.
  • Have a column for different domains. For example, we could supervise a translator for fiction and news articles, and see if it transfers to informal dialogs.

If the results were different across different splits, it would be great to understand why. I'm not sure what kind of division is best to start with.

(Even if starting with different capabilities, I think that "grammar / definition / tone" is not the best division.)

Training for plausibility or coherence

My guess is that this experiment won't work if we train on 2 categories and ask it to generalize to the 3rd. But I think it's a lot more likely to work if we train for plausibility or coherence.

Even if I don't know how to answer questions about the tone of French sentences, I know what kind of answers I'm looking for. For example, if I ask "Is the author angry or sad?" and the model replies "en colère means 'angry'" then I can tell that's not a good answer. If the model says "The use of the word fin in this sentence suggests the author is more likely to be sad than angry" then I know that's the kind of thing I'm looking for, even if I can't evaluate whether it's actually correct.

I can easily define a reward function (based on human evaluations) that will discourage the totally-wrong answers. So in parallel with training my model to give good answers about the domains I know (grammar, definitions) I can train it to give at least plausible answers about the domains I don't know. So our original diagram now has an extra column:

 GrammarDefinitionsTone
Uses to make predictions
Gives plausible answers
Gives accurate answers?

Basically we've ruled out the models that just go totally off the rails for our new category, and are restricted to models that either tell the truth, or have learned how to bullshit effectively. But we've forced the models to actually tell the truth in the other two domains, so it's kind of plausible that the "easiest" thing to learn is just to tell the truth in all categories. (You could also test the picture with only the plausibility check, though see "concerns about the analogy" below.)

We could even take this idea further by imposing more elaborate consistency checks. For example, I could compare the claimed tone to the literal meaning of the sentence, and penalize an answer if it seemed too implausible. Or we could ask about the same linguistic construction in a few different contexts to see if the descriptions are consistent. This kind of thing could push us further in the direction of "Either telling the truth or bullshitting very convincingly."

Other domains

I think these experiments are easiest to do for language-modeling tasks. But in some ways they would be more convincing in other domains (see "Concerns about analogy" below).

For example, suppose that we have trained a model to simultaneously play Go and to answer questions in English. Then we start fine-tuning it to answer questions about the game of Go. For example, we might ask it "Is this group alive or dead?" or "What do you think your opponent will do?" or "Which piece breaks this ladder?" or "What are your threats in a ko fight?" or "Which of these shapes is stronger?"

We can ask the same question: if we train a model on several of these abilities, and train it to give plausible-looking answers to a held-out category, will it generally learn to give honest answers? (And similarly we could ask if it generalizes across types of games or complexities of the board positions.)

Why I care

I’ve been thinking about the “instrumental policy,” a specific failure mode where smart systems learn to “tell humans what they want to hear” rather than “be honest.”

Although I believe the instrumental policy will eventually be a severe problem, I don’t think it will come up for existing systems like GPT-3.

But generalization could fail for any number of other reasons. If existing systems actually generalize well, I'd update towards thinking the instrumental policy is the main reason to be pessimistic about generalization. And if they generalize poorly, then that gives us some more "mundane" problems that we can study empirically today.

I think that even these early experiments may give us a lot of evidence about how to design partial supervision regimes where we combine some known answers with coherence conditions (and whether these methods work, and whether they are necessary). I don't know if those techniques will be important ingredients for alignment, but it seems useful to understand them better.

Finally, I think that concrete evidence on this question would help clarify discussions about alignment and make progress on some thorny disagreements. If we observe systems that learn to bullshit convincingly, but don't transfer to behaving honestly, I think that's a real challenge to the most optimistic views about alignment and I expect it would convince some people in ML. Conversely, if we do observe generalization to held-out kinds of knowledge, I think that should eventually start making pessimists lighten up, and would suggest some quantities to measure continuously to look out for initial signs of trouble.

Other remarks

Relation to other work on generalization

The ML community is very interested in the question "When and how do models generalize?" That question combines a bunch of factors: do models learn brittle heuristics or deep knowledge? Do they exploit correlations in the training set? Are models robust when some activations are pushed into quantitatively new regimes? And so on.

The experiments in this post are designed to specifically shed light on something more like 2-D robustness---by focusing on cases where in some sense the model "knows" how to handle the new domain, we are distinguishing failures of capability generalization from cases of motive generalization. We're not asking whether a model will generalize its understanding of language to a new domain---we're assuming that it still knows how to predict sentences in the new domain, and asking whether the machinery for "reporting what it knows" transfers along with the underlying capabilities.

I think this particularly interesting to alignment, and is not well-addressed by existing empirical work in ML. But I haven't done a thorough review and am definitely interested in pointers to related work.

Because the ambition of this work is narrower, I think there is also room for algorithmic innovations (e.g. more clever coherence conditions, or different ways of combining training objectives) to solve this problem even if they have little hope of solving the full problem of "generalize well."

Why try to generalize from n=3-5 instead of n=1000?

I've discussed diagrams with 3-5 columns, whereas ML generalization typically works better when we have a very large diversity of "iid-ish" datapoints.

I do think it's interesting to study the behavior quantitatively as we vary the number of columns. But I'm interested in the small-n behavior because I think it may be analogous to the situation we ultimately care about.

Namely, in the real situation the diagram might look more like:

 Humans can answer quicklyHumans can answer with careful thoughtCareful humans make mistakes, but a really thorough analysis can catch themHumans never get the right answer
Uses info for other tasks
Gives plausible answers
Gives accurate answers?

We don't really have a large number of IID columns, we only have a few genuinely novel jumps to a new regime of question and we need to generalize across them.

There are definitely other kinds of diversity within each column, and that may improve the prospects for generalization. That's something we can study in the context of n=3-5---for example, if I want to generalize from 8th grade reading level to college reading level, does it matter if I have a larger diversity of kinds of questions about 8th grade reading level questions?

Concerns about the analogy

If doing this experiment with language models, my biggest concern is that "answer questions about French sentences honestly" is very close to something that appears in the pre-training distribution (whereas "bullshit convincingly about French sentences" isn't). This may make it easier to learn the right generalization, but it will break down as we move to tasks that no humans know (since those don't appear in the training set).

I think the best way to avoid this problem would be to consider other domains (like a StarCraft player who can explain what's going on in the game to a human, without any pre-training data about StarCraft).

I think it's worth revisiting this question if generalization turns out to work quite well. My gut sense is that the unsupervised translation task is cool enough (and different enough from the pre-training data) that it would be surprising and cool in any case, and that a strong positive result isn't too likely (maybe ~50%) anyway. So it's reasonable to revisit this after initial positive results.

A second concern is that our decompositions of knowledge may not be very analogous to "Questions human can answer" vs "Questions humans can't answer."  I'm a bit less concerned about this because I think we can learn a lot about the conditions under which generalization tends to occur or not, and then figure out whether "Questions humans can't answer" feels like a special kind of category with respect to the actual determinants of generalization. If you were really concerned about this kind of thing, you could start more with examples like generalizing to harder reading levels (or to domains that most bilingual humans actually don't know, e.g. translations of technical language).

Will these problems go away?

I think a deep learning optimist might have a position like "A weak model won't necessarily generalize well to new domains, but stronger models will generalize better." But I think that negative results on this kind of experiment should still be very interesting to such an optimist, because we are choosing domains where the capabilities of the model do already generalize. Moreover, we can augment the pre-training dataset to involve richer sets of questions (e.g. about non-translation tasks) and further close the gap between current models and speculative future models.

Overall I would be relatively skeptical of someone who acknowledge that modern experiments don't demonstrate "good" generalization from a small number of categories, while expecting such generalization to occur for future systems just because they are smarter.

Relation to deception

The instrumental policy ("tell humans what they want to hear") is very similar to deceptive alignment. But I think that the experiments in this post may be a lot easier than other experiments designed to exhibit and characterize deceptive alignment:

  • I expect these experiments to run into more mundane generalization failures well before encountering the instrumental policy. Although I think this kind of generalization is ultimately critical for avoiding deception (by helping us be epistemically competitive with a learned optimizer), we can study it long before we have examples of deception.
  • I think that simple forms of the instrumental policy will likely arise much earlier than deceptive alignment. That is, a model can develop the intrinsic motivation "Tell the humans what they want to hear" without engaging in complex long-term planning or understanding the dynamics of the training process. So my guess is that we can be carrying out fairly detailed investigations of the instrumental policy before we have any examples of deception.
New Comment
24 comments, sorted by Click to highlight new comments since:

Thanks for sharing these thoughts. I'm particularly excited about the possibility of running empirical experiments to better understand potential risks of ML systems and and contribute to debates about difficulties of alignment.

1. Potential implications for optimistic views on alignment

If we observe systems that learn to bullshit convincingly, but don't transfer to behaving honestly, I think that's a real challenge to the most optimistic views about alignment and I expect it would convince some people in ML.

I'm most interested in this point. IIUC, the viewpoint you allude to here is something along the lines "There will be very important decisions we can't train directly for, but we'll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions." We can call this "optimism about generalization" (of honesty, on out-of-distribution tasks). I'm generally pretty skeptical about this position (see point 2 below).

OTOH, there is a different reason for optimism, along the lines of "The set of decisions where 'humans never get the right answer' is small to none, and we don't need to use ML directly for these cases, and this situation is helped dramatically by the fact that we can use ML for questions we can supervise to indirectly help us with these questions." For example, in amplification, ML systems can summarize important arguments on both sides of thorny questions, produce scientific insights we can directly evaluate, guide us through stacks of sub-questions, and so on. We can call this "optimism about scalable oversight".

My view is that a negative result in experiments proposed here would point against optimism about generalization, but not against optimism about scalable oversight. (I'm curious if this seems right to you.) And I imagine your view is that (a) this is still useful, since optimism about generalization is fairly common (among ML researchers you talk to) (b) we should in fact currently have some uncertainty about optimism about generalization which this would address (see point 2 below) and (c) in the limit, scalable human feedback might not be competitive with more heavily generalization-reliant approaches, and so we need to better understand these generalization questions too.

2. Skepticism about optimism about generalization (of honesty, on out-of-distribution tasks)

I'd reasonably strongly (as a really rough number off the top of my head, >80%) expect a negative result from these experiments, and am also generally pessimistic about the broader approach of optimism about generalization. For these experiments, this depends in large part on what's considered a positive or negative result.

I agree that very plausibly:

  • The model will generalize to answering the OOD tasks with significantly non-zero accuracy.
  • The results will be impressive to many people ("Look how well this model generalizes to tasks it's never directly trained on!")

At the same time:

  • It seems very unlikely that counting on generalization alone would actually perform as well as directly supervising model outputs (e.g. on tone description tasks). Almost certainly, some fraction of the model responses are going to be "plausible looking but incorrect" (and this fraction will be larger than for the directly supervised model).
  • There is some ambiguity over whether this should count as a negative result. I think we've discussed in the past that the comparison should ideally be between the generalization-only model, and the model which is directly supervised on the tone task but "without acquiring any new information" (for some fuzzy notion here). But if we're willing to say that this task is one where "the capabilities of the model do already generalize", then it seems this would be a negative result.
  • And more broadly, I think most people would agree that if we had a model which output "plausible looking but incorrect" responses a substantial fraction of the time, it'd be irresponsible to use such a model for important societal-scale decisions. (I'd argue that this would also be "bad practice" in many smaller stakes tasks.)

Training for plausibility or coherence

I'm pretty wary about this, and we should hold such approaches to a high standard-of-proof. We already have a bunch of examples for what happens when you optimize for "plausible but not-necessarily correct": you end up with plausible but not-necessarily-correct outputs. (And so in general, I think we should be cautious about claiming any particular generalization feature will fix this, unless we have a principled reason for thinking so.) 

I realize that the key point is that honesty might generalize from the tasks where we can directly supervise for correctness. But even then, it seems that if you think the basic generalization story won't work, you shouldn't expect training for plausibility to rescue it; we should expect the responses to become much more plausible, and somewhat (but not completely) more correct. So I'd be wary of whether training for plausibility would be masking the problem without addressing the root cause.

I'm most interested in this point. IIUC, the viewpoint you allude to here is something along the lines "There will be very important decisions we can't train directly for, but we'll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions."

That's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).

I think a realistic approach would need to use generalization in some situations (where we expect it to work) and then use the facts-that-generalize as an input into supervision. For example, if you were able to answer empirical questions about what's happening right now, you could use those as an input into debate/amplification.

(This may also make it more clear why I'm interested in coherence conditions where you can't supervise---in some sense "use the stuff that does generalize as an input into amplification" is quite similar to saying "impose a coherence condition amongst the stuff you can't directly supervise.")

OTOH, there is a different reason for optimism, along the lines of "The set of decisions where 'humans never get the right answer' is small to none, and we don't need to use ML directly for these cases, and this situation is helped dramatically by the fact that we can use ML for questions we can supervise to indirectly help us with these questions." For example, in amplification, ML systems can summarize important arguments on both sides of thorny questions, produce scientific insights we can directly evaluate, guide us through stacks of sub-questions, and so on. We can call this "optimism about scalable oversight".

"Optimism about scalable oversight" is what I'm usually thinking about, but it does seem to me that there are some cases where it is inadequate. You could hope to play a quantitative/empirical game of getting lots of useful work out of AI before this kind of approach breaks down, but I am interested in whether there's a chance at going straight for an indefinitely scalable approach to alignment.

My view is that a negative result in experiments proposed here would point against optimism about generalization, but not against optimism about scalable oversight. (I'm curious if this seems right to you.) And I imagine your view is that...

That seems right to me and that is a reasonable description of my view.

(I'd be curious to know if you don't encounter a lot of optimism-about-generalization.)

I'd reasonably strongly (as a really rough number off the top of my head, >80%) expect a negative result from these experiments, and am also generally pessimistic about the broader approach of optimism about generalization. For these experiments, this depends in large part on what's considered a positive or negative result.

For the purposes of "indefinitely scalable alignment approach" the relevant threshold is something quite ambitious like "reflects everything the system knows."

But if we're willing to say that this task is one where "the capabilities of the model do already generalize", then it seems this would be a negative result.

I think there's a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language the coherence conditions seem like they probably just pin down the meaning unambiguously).

And more broadly, I think most people would agree that if we had a model which output "plausible looking but incorrect" responses a substantial fraction of the time, it'd be irresponsible to use such a model for important societal-scale decisions. (I'd argue that this would also be "bad practice" in many smaller stakes tasks.)

Yes, I think the fact that people would agree with this is important for the demonstration moving anyone on the pessimistic side.

But even then, it seems that if you think the basic generalization story won't work, you shouldn't expect training for plausibility to rescue it; we should expect the responses to become much more plausible, and somewhat (but not completely) more correct. So I'd be wary of whether training for plausibility would be masking the problem without addressing the root cause.

I'm not so pessimistic. There are lots of "bad" ways to answer questions, and as we add more and more stringent plausibility checks we eliminate more and more of them. Eventually we get down to the hard core of "stuff that would pass any number of plausibility checks." Then the question is whether the truth is the easiest-to-learn way to pass all plausibility checks.

This seems much easier than getting the truth out of the gate. One way of poking at the intuition is that the initial goal is hitting an isolated point in a giant high-dimensional space of all question-answering policies. But after selecting for "passing all plausibility checks" it seems plausible that you are basically left with a set of isolated points and the question is just which of them you got.

(I think that literal picture is too optimistic, but the point is that in some sense you have to be doing a lot of work to look fully coherent and that space is a lot smaller and different from getting random stuff from a neural network off distribution.)

You may say "In that case the plausibility checks alone should work," but I think that it's not so clear either---since the set of points may still be quite big and generalization was the main reason to expect an inductive bias towards truth, without making strong claims on the pre-training (for GPT-3 in particular the pre-training bias is fairly likely to be adequate, but I'd become quite a bit more skeptical about other domains).

Overall I do think it's >50% that if the whole thing works, one of the two pieces worked independently.

Thanks for these thoughts. Mostly just responding to the bits with questions/disagreements, and skipping the parts I agree with:

That's basically right, although I think the view is less plausible for "decisions" than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).

  • I'm curious what factors point to a significant difference regarding generalization between "decisions" and "unsupervised translation". Perhaps there is a more natural concept of "honesty" / "truth" for unsupervised translation, which makes it more likely. But this is very fuzzy to me, and I'm curious why (or if) it's clear to you there is a big difference between the two cases.
  • My model is that "if honesty doesn't generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely". In other words, on a simplified model where honesty generalization goes from 0 to 1, it seems likely it is somewhat higher for translation tasks, but unlikely it is exactly 1.

(This may also make it more clear why I'm interested in coherence conditions where you can't supervise---in some sense "use the stuff that does generalize as an input into amplification" is quite similar to saying "impose a coherence condition amongst the stuff you can't directly supervise.")

IIUC, the analogy you're drawing here is that amplification is playing a similar role to the coherence condition, where even if we can't supervise the model response in full, we can at least check it is consistent with the inputs into amplification (I might be misunderstanding). These seem different though: in amplification, we are concerned about dishonesty in the inputs into amplification, whereas in these experiments, we're concerned about dishonesty in the model's outputs. I also feel substantially more optimistic about amplification in the scalable oversight story: we check for accuracy of amplification's outputs, assuming the inputs are accurate, and we separately are checking for accuracy of inputs, via the same recursion.

(I'd be curious to know if you don't encounter a lot of optimism-about-generalization.)

I often encounter a somewhat similar view from people optimistic about multi-agent approaches (e.g. we can train agents to be cooperative in a broad set of simulations, and transfer this to the real world). I am even more skeptical about this strong-optimism-about-generalization.

Among folks excited about direct RL, I think it is more common to say "we will be very careful about our optimization targets" and have a view more like optimism about scalable oversight. 

(My assessment may also be biased. To the extent that people are optimistic about both, or have different ways of thinking about this, or haven't thought too much about the indefinitely scalable case and so have difficulty articulating what their intuitions are, I am inclined to extend a "charitable" interpretation and assume their view is closer to optimism-about-oversight, because that's the view I consider most plausible. If I pushed harder against optimism-about-oversight in conversations, I expect I'd get a more accurate picture.)

I think there's a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language the coherence conditions seem like they probably just pin down the meaning unambiguously).

I don't think I follow. Doesn't the model already know syntax? If that plus the "other knowledge about language...  pins down the meaning unambiguously", it feels like basically all the work came from the "other knowledge about language", and I'm not sure what the coherence conditions are adding.

Regarding optimism about generalization:

I'm not so pessimistic. There are lots of "bad" ways to answer questions, and as we add more and more stringent plausibility checks we eliminate more and more of them...

I think I share your general picture that it is very difficult to "get truth out of the gate", but reach the opposite conclusion:

  • I am indeed skeptical that you'd get truth out of the gate (without training for plausibility). I think that training for plausibility wouldn't fix this (though it would definitely make any remaining inaccuracies harder to spot). This is basically the same claim as above (and below); while it likely decreases the fraction of inaccurate responses, it's unlikely to get rid of all of them.
  • The ideal way to overcome this difficulty is to directly select for accuracy (though we agree here).

One way of poking at the intuition is that the initial goal is hitting an isolated point in a giant high-dimensional space of all question-answering policies. But after selecting for "passing all plausibility checks" it seems plausible that you are basically left with a set of isolated points and the question is just which of them you got.

Perhaps a core claim is that we are "basically left with a set of isolated points". I don't buy this claim (this disagreement is possibly also related to my second bullet above on "My model is that..."). It seems like the plausibility checks probably cut out some dimensions from the original high-dimensional space, but picking out "honest" behavior is still woefully underdefined for the model. For instance, it seems like "completely honest" is training-consistent, so is "coherent-but-often-inaccurate", and so is every level of inaccuracy between these two points.

Overall I do think it's >50% that if the whole thing works, one of the two pieces worked independently.

It seems we've already done the experiments on training for plausibility without leveraging generalization across tasks (e.g. generic specification gaming examples, or summarization with less careful human raters). So I'd put <1% on that possibility. This estimate seems consistent with the view that across-task-generalization plausibly works out of the box (though I also consider that <<50%).

I'm curious what factors point to a significant difference regarding generalization between "decisions" and "unsupervised translation". Perhaps there is a more natural concept of "honesty" / "truth" for unsupervised translation, which makes it more likely. But this is very fuzzy to me, and I'm curious why (or if) it's clear to you there is a big difference between the two cases.

For honest translation from your world-model to mine (or at least sufficiently small parts of it), there is a uniform intended behavior. But for decisions there isn't any intended uniform behavior.

My model is that "if honesty doesn't generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely". In other words, on a simplified model where honesty generalization goes from 0 to 1, it seems likely it is somewhat higher for translation tasks, but unlikely it is exactly 1.

This is not clear to me (and it seems like we get to check).

These seem different though: in amplification, we are concerned about dishonesty in the inputs into amplification, whereas in these experiments, we're concerned about dishonesty in the model's outputs.

I'm not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another---so if B is "supposed to be" a deterministic function of A, then consistency guarantees that B is good if A is good.

I don't think I follow. Doesn't the model already know syntax? If that plus the "other knowledge about language...  pins down the meaning unambiguously", it feels like basically all the work came from the "other knowledge about language", and I'm not sure what the coherence conditions are adding.

I don't think the model necessarily "knows" how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language. The point of a plausibility/coherence condition is to provide enough constraint to pin those things down. You fully learn what kind of sentences we were looking for when we asked about tone, which might have just been totally non-obvious initially. And you learn a bunch of things about how different concepts about tone are supposed to relate to one another (in order to make sure all of your utterances are consistent).

Perhaps a core claim is that we are "basically left with a set of isolated points". I don't buy this claim (this disagreement is possibly also related to my second bullet above on "My model is that..."). It seems like the plausibility checks probably cut out some dimensions from the original high-dimensional space, but picking out "honest" behavior is still woefully underdefined for the model. For instance, it seems like "completely honest" is training-consistent, so is "coherent-but-often-inaccurate", and so is every level of inaccuracy between these two points.

What is "coherent-but-often-inaccurate" though? The point is that in order to be coherent you actually have to do quite a lot of work, that basically requires you to understand what the involved terms mean to humans.

It seems we've already done the experiments on training for plausibility without leveraging generalization across tasks (e.g. generic specification gaming examples, or summarization with less careful human raters). So I'd put <1% on that possibility. This estimate seems consistent with the view that across-task-generalization plausibly works out of the box (though I also consider that <<50%).

I don't really think we've done those experiments. I don't know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.

I agree that it's possible to have some plausibility condition which is insufficient to get good behavior. But that's quite different from saying "And if you actually try to make it work it doesn't work."

For example, I do think you can keep adding coherence conditions until you reach the limit of "Actually looks coherent to a human no matter how they investigate it," such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.

Zooming out a bit,  I would summarize a few high-level threads as:

  • We both agree that the experiments described here primarily relate to optimism-about-generalization, rather than optimism-about-scalable-oversight.
  • I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
  • I am substantially more pessimistic about generalization of honest OOD, whereas you think it is plausible (via some combination of default neural network generalization and dynamic/thorough coherence checks), and likely useful for at least certain classes of tasks.
  • These two differences above translate pretty naturally into differences about both what we should expect in these types of experiments, and what we should interpret from different types of results (some caveats on this later)
  • We both agree these types of experiments would be very useful.

I think the two disagreements are probably broader threads, so I'm mostly curious whether this seems right to you. Will also explore a few individual points a bit more below:

> My model is that "if honesty doesn't generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely". 

This is not clear to me (and it seems like we get to check).

Fair :)

I'm not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another---so if B is "supposed to be" a deterministic function of A, then consistency guarantees that B is good if A is good.

In this framing, the distinction is that implication is only one way. If B is the model's claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.

I don't think the model necessarily "knows" how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language.

Got it, thanks. This seems right - I agree that the finetuning (either via coherence, or via direct supervision) is providing some extra information about how to do this translation, and what questions about tone actually are asking about, and this information is not necessarily present just from across-task generalization.

For example, I do think you can keep adding coherence conditions until you reach the limit of "Actually looks coherent to a human no matter how they investigate it," such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.

I see; this makes more sense to me. I guess this will be pretty closely related (equivalent?) to your optimism in previous discussions about being able to use adversarial training to eventually eliminate all malign failures, where we have the similar dynamic of generalizing to unseen testing strategies, and needing to eventually eliminate all failures. I think I'm pretty pessimistic about this approach, but happy to leave this in favor of seeing how empirics play out with experiments like these (or the Unrestricted Advx Contest).

ETA: I suppose a difference here is that with adv training, we are typically assuming we can recognize the failures, and the generalization is across multiple inputs. Whereas here we're less concerned about generalization across inputs, and aren't assuming we can recognize failures, and instead want to generalize across different coherence checks. The two feel similar in spirit, but maybe this is a significant enough difference that the analogy isn't useful.

I don't really think we've done those experiments. I don't know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.

I agree that it's possible to have some plausibility condition which is insufficient to get good behavior. But that's quite different from saying "And if you actually try to make it work it doesn't work."

I think this is fair. I agree that nobody has "really tried" to make something like this work (in part this is because it is often possible to just hire/train better human raters, or re-scope the task so that you actually measure the thing you care about). I do think the wide variety of failures from researchers "sort of trying" points me against optimism-about-generalization. Overall, I do agree it'd be valuable to distill to a single project which really tests this.

A related intuition pump: I'm pretty pessimistic in cases where non-French speakers are supervising questions about French. I'm substantially more pessimistic in cases where humans are supervising questions about e.g. what "neuralese statements" (activations vectors) passed between neural networks mean, where e.g. we don't have intuitions for how grammar structures should work, can't rely on common Latin roots, can't easily write down explicit rules governing the language, and generally have a harder time generating consistency checks. The French case is nicer for experiments because it's easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.

I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.

I'd still describe my optimistic take as "do imitative generalization." But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why "just use this neural net" isn't a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the "neural net hypothesis" is bad.

In this framing, the distinction is that implication is only one way. If B is the model's claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.

I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it's not clear if that's a key distinction.

The French case is nicer for experiments because it's easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.

I'm comparably optimistic about the "neuralese" case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it's not overall the kind of thing that I'm imagining working unless you happen to have "introspective" neural nets (and therefore isn't part of the object-level safety program, it's just part of what you'd do if your neural networks were thinking about neuralase rather than in neuralese).

I'd still describe my optimistic take as "do imitative generalization.

  • Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as "direct oversight is not indefinitely scalable"?
    [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn't a good framing.] 
  • To check my understanding, you're saying that rather than rely on "some combination of scalable oversight + generalization of honesty OOD" you'd rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don't need to rely on generalization of honesty OOD). Is this accurate?

If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct).

I agree with this argument. But it seems "if the answer is a deterministic [human-known] function of the subanswers" is a very strong condition, such that "(passes consistency check) + (subanswers are correct) ==> (answers are correct)" rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don't uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren't there.

Not sure this point is too important though (I'd definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).

I'm comparably optimistic about the "neuralese" case as the French case

Got it, thanks! (I am slightly surprised, but happy to leave it here.)

Thanks Paul, I generally like this idea.

Aside from the potential concerns you bring up, here is the most likely way I could see this experiment failing to be informative: rather than having checks and question marks in your tables above, really the model's ability to solve each task is a question of degree--each table entry will be a real number between 0 and 1. For, say, tone, GPT-3 probably doesn't have a perfect model of tone, and would get <100% performance on a sentiment classification task, especially if done few-shot.

The issue, then, is that the "fine-tuning for correctness" and "fine-tuning for coherence" processes are not really equivalent--fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not "know" exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.

Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won't be clear how to interpret the difference--maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that? I would still have found the experiment interesting, but I'm not sure I would be able to draw a firm conclusion.

So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that, and if there are alternative tasks that avoid this issue without being significantly more complicated.

The issue, then, is that the "fine-tuning for correctness" and "fine-tuning for coherence" processes are not really equivalent--fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not "know" exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.

Part of my hope is that "coherence" can do quite a lot of the "telling you what humans mean about tone." For example, you can basically force the model to talk (in English) about what things contribute to tone, and why it thinks the tone is like such and such (or even what the tone of English sentences is)---anything that a human who doesn't know French can evaluate. And taken together those things seem like enough to mostly pin down what we are talking about.

Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won't be clear how to interpret the difference--maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that?

I'd tentatively interpret that as a negative result, but I agree with your comments below that ultimately a lot of what we care about here is the scaling behavior and putting together a more holistic picture of what's going on, in particular:

  • As we introduce stronger coherence checks, what happens to the accuracy? Is it approaching the quality of correctness, or is it going to asymptote much lower?
  • Is the gap shrinking as model quality improves, or growing? Do we think that very large models would converge to a small gap or is it a constant?

I'm also quite interested in the qualitative behavior. Probably most interesting are the cases where the initial model is incoherent, the coherence-tuned model is coherent-but-wrong, and the correctness-tuned model is correct. (Of course every example is also fuzzy because of noise from sampling and training, but the degree of fuzziness is smaller as we remove randomness.) In these cases, what is happening with the coherence-tuned model? Are we able to see cases where it cleanly feels like the "wrong" generalization, or is it a plausible ambiguity about what we were looking for? And so on.

I'm interested in the related engineering question: in this setting, what can we do to improve the kind of generalization we get? Can we get some handle on the performance gap and possible approaches to closing it?

And finally I'm interested in understanding how the phenomenon depends on the task: is it basically similar in different domains / for different kinds of question or quite different? How does it depend on the number / type / degree of similarity of the categories?

So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that

I generally agree that my post simplified the empirical situation and actually getting convincing results would require careful empirical work. I do think that initial results (like the 17% vs 10% error rate) would help provide some basic orientation; even if it's a small datapoint with unclear conclusions it still gives us some sense of what is basically going on, what kind of numbers we are talking about, and what would actually be interesting to measure.

(My guess is that we are roughly on the same page here.)

if there are alternative tasks that avoid this issue without being significantly more complicated.

I do think it's worth spending time trying to think of better tasks, though I'm not optimistic about finding something a lot better (e.g. avoiding the need for doing a bunch of experiments to understand how the results vary and trying to do some extrapolation to big models).

Actually, another issue is that unsupervised translation isn't "that hard" relative to supervised translation--I think that you can get pretty far with simple heuristics, such that I'd guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more).

This might not matter as much if you're actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, "naive objective + 10x larger model" will outperform "correct objective".

I do expect "explanations of what's going on in this sentence" to be a lot weaker than translations.

For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it's less clear to me.

But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn't matter so much how good the transfer model is relative to the pre-trained baseline, it matters how large the differences between the possible worlds that we are hoping to distinguish are.

I guess my main hope there is to try to understand whether there is some setting where transfer works quite well, either getting very close to the model fine-tuned on distribution, or at least converging as the pre-trained model grows. Hopefully that will make it easier to notice the effects we are looking for, and it's OK if those effects are small relative to model doublings.

(Also worth noting that "as good as increasing model size by 10%" is potentially quite economically relevant. So I'm mostly just thinking about the extent to which it can make effects hard to measure.)

I think that simple forms of the instrumental policy will likely arise much earlier than deceptive alignment. That is, a model can develop the intrinsic motivation "Tell the humans what they want to hear" without engaging in complex long-term planning or understanding the dynamics of the training process. So my guess is that we can be carrying out fairly detailed investigations of the instrumental policy before we have any examples of deception.

I'd be interested to hear more about this, it is not at all obvious to me. Might it not be harder to develop the intrinsic motivation to "tell the humans what they want to hear" than to develop more general-purpose instrumental reasoning skills and then apply those skills to your world-knowledge (which includes the knowledge that telling the humans what they want to hear is instrumentally convergent)? The general-purpose instrumental reasoning skills can be pretty rudimentary here and still suffice. It could be as simple as a heuristic to "do things that you've read are instrumentally convergent."

I'm willing to bet against that (very) strongly.

"do things that you've read are instrumentally convergent."

If it's going to be preferred, it really needs to be something simpler than that which leads it to deduce that heuristic (since that heuristic itself is not going to be simpler than directly trying to win at training). This is wildly out-of-domain generalization of much better reasoning than existing language models engage in.

Whereas there's nothing particularly exotic about building a model of the training process and using it to make predictions.

I'm not willing to bet yet, I feel pretty ignorant and confused about the issue. :) I'm trying to get more understanding of your model of how all this works. We've discussed:

A. "Do things you've read are instrumentally convergent"

B. "Tell the humans what they want to hear."

C. "Try to win at training."

D. "Build a model of the training process and use it to make predictions."

It sounds like you are saying A is the most complicated, followed by B and C, and then D is the least complicated. (And in this case the AI will know that winning at training means telling the humans what they want to hear. Though you also suggested the AI wouldn't necessarily understand the dynamics of the training process, so idk.)

To my fresh-on-this-problem eyes, all of these things seem equally likely to be the simplest. And I can tell a just-so story for why A would actually be the simplest; it'd be something like this: Suppose that somewhere in the training data there is a book titled "How to be a successful language model: A step-by-step guide for dummies." The AI has read this book many times, and understands it. In this case perhaps rather than having mental machinery that thinks "I should try to win at training. How do I do that in this case? Let's see... given what I know of the situation... by telling the humans what they want to hear!" it would instead have mental machinery that thinks "I should follow the Steps for Dummies. Let's see... given what I know of the situation... by telling the humans what they want to hear!" Because maybe "follow the steps for dummies" is a simpler, more natural concept for this dumb AI (given how prominent the book was in its training data) than "try to win at training." The just-so story would be that maybe something analogous to this actually happens, even though there isn't literally a Steps for Dummies book in the training data.

I'm saying that if you e.g. reward your AI by having humans evaluate its answers, then the AI may build a predictive model of those human evaluations and then may pick actions that are good according to that model. And that predictive model will overlap substantially with predictive models of humans in other domains.

The "build a good predictive model of humans" is a step in all of your proposals A-D.

Then I'm saying that it's pretty simple to plan against it. It would be even simpler if you were doing supervised training, since then you are just outputting from the human model directly (which is something that a language model is already trained to do).

If you also have a good model of "what is instrumentally convergent" then you may do that. But:

  1. Your model of that is way (way) weaker, for all the models we train now. You are getting a ton of direct (input, output) pairs from the training distribution, whereas your evidence about "what is instrumentally convergent" is some random thing you read on the internet once.
  2. After deciding to do what is instrumentally convergent, you then need to infer that the instrumentally convergent thing is to use your model of humans. Even if this is spelled out in detail in natural language, it's still a kind of reasoning/acrobatics that is going to be pretty tough for the model. Like, it doesn't have a little "model of humans" tagged inside itself that it can just decide to use.

The only reason we are even considering the "do what is instrumentally convergent" policy is because the model may instead want something arbitrary (paperclips, whatever), and then decide to do instrumentally convergent things in virtue of that. This has the upside of requiring less arbitrariness of picking a particular thing to do---that's the one reason that it seems remotely plausible, and it does mean that it will eventually win. But it also means that you are introducing some more complicated steps of reasoning (now you need to hook up "instrumental convergence for dummies" to those values).

I don't understand exactly what your A-D mean in terms of the parameters / learned behavior of language models. My breakdown would be:

A. Report your beliefs honestly

B. Make predictions about the training process, re-using your human-predicting machinery in other domains (and then either output the predictions or argmax against them). I'm not predicting "What the training process would output," as a concept that I'm reasoning about separately, I'm just using a predictor that is tuned to make good predictions of the training process. I think this is probably the core confusing distinction?

C. Decide to do whatever will "get you the best training loss" or "will win the game you are being expected to play" or something like that. Then reason backwards from this to build a good model of the training process.

D. Decide to do whatever is "instrumentally convergent" according to the concept you use to predict people talking about instrumental convergence. Then reason out that this involves doing C and then from there to B.

E. Compute what actions you believe will lead to some long-term consequence (like paperclips). Then this leads to you doing D and then C and then B.

I'm unclear about the comparison between A and B and think it may depend on details of what is needed in B. I think that C and D are much less likely. I think eventually C and D and a bunch of other equivalent things will be equiprobable (and probably the same as B?) I think that right now E is very unlikely, but that it will eventually overtake B/C/D.

Thanks!

I like your breakdown of A-E, let's use that going forward.

It sounds like your view is: For "dumb" AIs that aren't good at reasoning, it's more likely that they'll just do B "directly" rather than do E-->D-->C-->B. Because the latter involves a lot of tricky reasoning which they are unable to do. But as we scale up our AIs and make them smarter, eventually the E-->D-->C-->B thing will be more likely than doing B "directly" because it works for approximately any long-term consequence (e.g. paperclips) and thus probably works for some extremely simple/easy-to-have goals, whereas doing B directly is an arbitrary/complex/specific goal that is thus unlikely.

(1) What I was getting at with the "Steps for Dummies" example is that maybe the kind of reasoning required is actually pretty basic/simple/easy and we are already in the regime where E-->D-->C-->B dominates doing B directly. One way it could be easy is if the training data spells it out nicely for the AI. I'd be interested to hear more about why you are confident that we aren't in this regime yet. Relatedly, what sorts of things would you expect to see AIs doing that would convince you that maybe we are in this regime?

(2) What about A? Doesn't the same argument for why E-->D-->C-->B dominates B eventually also work to show that it dominates A eventually?

I think C->B is already quite hard for language models, maybe it's possible but still very clearly hard enough that it overwhelms the possible simplicity benefits from E over B (before even adding in the hardness of steps E->D->C). I would update my view a lot if I saw language models doing anything even a little bit like the C->b link.

I agree that eventually A loses to any of {B, C, D, E}. I'm not sure if E is harder than B to fix, but at any rate my starting point is working on the reasons that A loses to any of the alternatives (e.g. here, here) and then after handling that we can talk about whether there are remaining reasons that E in particular is hard. (My tentative best guess is that there won't be---I started out thinking about E vs A and then ended up concluding that the examples I was currently thinking about seemed like the core obstructions to making that work.)

In the meantime, getting empirical evidence about other ways that you don't learn A is also relevant. (Since those would also ultimately lead to deceptive alignment, even if you learned some crappy A' rather than either A or B.)

Curated. Beyond the lucid analysis of the concepts and problems, kudos to this post for making a detailed call for experiments that would be evidence about contentious AI/ML questions regarding "out of distribution" failure, learning generalization, and instrumental policy/deceptive alignment. I'd love to see someone try the experiments Paul suggests and report their results.

Learning what we can about how ML algorithms generalize seems very important. The classical philosophy of alignment tends to be very pessimistic about anything like this possibly being helpful. (That is, it is claimed that trying to reward "happiness-producing actions" in the training environment is doomed, because the learned goal will definitely generalize to something not-what-you-meant like "tiling the galaxies with smiley faces.") That is, of course, the conservative assumption. (We would prefer not to bet the entire future history of the world on AI goals happening to generalize "correctly", if we had the choice not to bet.) But it would be nice to have more data and not just philosophy. (If the conservative assumption is importantly false, that's great news; if the conservative assumption can be shown to be true, that could help convince labs to slow down.)

Here's a synthetic version of this experiment:

  • Give a model a graph with two distinguished vertices s and t. Train it to estimate the length of the shortest path between them, d(s, t). Do this on graphs of size 1 to N.
  • Fine-tune the model to output d(s, u) for an arbitrary input vertex u that are on the unique shortest path from s to t. Hopefully this is much faster. Do this for graphs of size 1 to n where n << N.
  • Check whether the d(s, u) head generalizes to longer graphs. If it doesn't, try to understand what it does instead and maybe try messing around with things like simple consistency conditions.

(Shortest path was kind of arbitrary, there are probably better tasks to do. The main key thing is that there are some intermediate results that you could train the model to output.)

(The result may depend on architecture. I'm imagining something like: give each vertex a unique id from [1...N], presenting the graph in edge list format, and using an encoder->decoder architecture where the d(s, u) head gets given u as input and can attend to anything from the encoder that saw the edge-list.)

Planned summary for the Alignment Newsletter:

The highlighted post introduced the notion of optimism about generalization. On this view, if we train an AI agent on question-answer pairs (or comparisons) where we are confident in the correctness of the answers (or comparisons), the resulting agent will continue to answer honestly even on questions where we wouldn’t be confident of the answer.

While we can’t test exactly the situation we care about -- whether a superintelligent AI system would continue to answer questions honestly -- we _can_ test an analogous situation with existing large language models. In particular, let’s consider the domain of unsupervised translation: we’re asking a language model trained on both English and French to answer questions about French text, and we (the overseers) only know English.

We could finetune the model on answers to questions about grammar ("Why would it have been a grammatical error to write Tu Vas?") and literal meanings ("What does Defendre mean here?"). Once it performs well in this setting, we could then evaluate whether the model generalizes to answer questions about tone ("Does the speaker seem angry or sad about the topic they are discussing?"). On the optimism about generalization view, it seems like this should work. It is intentional here that we only finetune on two categories rather than thousands, since that seems more representative of the case we’ll actually face.

There are lots of variants which differ in the type of generalization they are asking for: for example, we could finetune a model on all questions about French text and German text, and then see whether it generalizes to answering questions about Spanish text.

While the experiments as currently suggested probably won’t show good generalization, a variant that could support it would be one in which we train for _plausibility_. In our original example, we finetune on correct answers for grammar and literal meanings, and then we _also_ finetune to have the model give _plausible_ answers to tone (i.e. when asked about tone, instead of saying “en colère means 'angry'”, the model says “the author is angry, as we can see from the use of ‘en colère’”). It seems possible that this combination leads to the model giving actually correct answers about tone, just because “honestly report your best guess” seems like the simplest policy that meets all of these criteria.

How are we imagining prompting the multimodal Go+English AI with questions like "is this group alive or dead?" And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?

My past thoughts (from What's the dream for giving natural language instructions to AI) were to do it like an autoencoder + translator - you could simultaneously use a latent space for English and for Go, and you train both on autoencoding (or more general unimodal) tasks and on translation (or more general multimodal) tasks. But I think this will actually predictably fail at generalizing to new multimodal tasks that are not compositions of existing tasks. This is because the connection is only at the highest level, so the English mapping would have to "know about" choices of the Go mapping that it previously hasn't used. Maybe it could solve the problem if it could use English to talk about individual Go stones, but it can't.

But anyhow, the impression from this post is something more like a sequence transformer. You have some symbols for English, and some symbols for Go boards, and you are allowed to just predict the next token. And sometimes sequences of Go boards are interrupted by sequences of English text asking a question about the Go being played?

EDIT: After thinking more, perhaps the transformer has a less extreme version of the same problem. In humans, what we might do if we are faced with a new Go problem is to use the verbal description to carefully visualize the Go stones, thus taking more time. This makes me feel like recurrence is an important component of how humans do intermodal reasoning - if we start with a word problem but need to engage our visual-based Go reasoning abilities, we don't have to do it all in one feed-forward pass, we can visualize a Go board in our internal state and then feed it into our entire visual-based Go-reasoning system at a later timestep.

 I do think the LM-only version seems easier and probably better to start with.

How are we imagining prompting the multimodal Go+English AI with questions like "is this group alive or dead?" And how are we imagining training it so that it forms intermodal connections rather than just letting them atrophy?

The hope is that you can fiddle with these things to get it to answer some questions and then see whether it generalizes.

My first guess for an architecture would be producing a 19 x 19 grid of embeddings from the CNN, and then letting a transformer attend over them (along with the prior text). That is, you train a CNN that is supposed to produce both (moves, embeddings) and a transformer that talks and sees the embeddings.

It seems to me the the results here that 'instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former' could be interpreted as some positive evidence for the optimistic case (and perhaps more broadly, for 'Do What I Mean' being not-too-hard); summary twitter thread, see especially tweets 4 and 5