On Twitter Jan Leike asks:

With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English.

We still don't know why.

I wish someone would figure this out.

I find myself surprised/confused at his apparent surprise/confusion.

My default response had someone asked me what I thought was going on would have been something like: "following instructions is a natural abstraction of a task", hence models trained to follow instructions in English generalising to other languages is a natural example of goals/capabilities generalisation.

It's like being surprised that you taught an AI to drive red cars and it can drive blue cars as well [capabilities generalisation]. Or if you taught an AI to reach a particular location on a level when there's a coin there and it still heads to said location in the absence of the coin. 🤭 [goal generalisation].

In Coin Run, there were multiple possible natural generalisations for the goal (location on level, the coin), but for instruction following in English, there seems to be only one natural generalisation of the goal? (I don't really see any other intuitively plausible generalisation of "following instructions in English" when the model switches to a Chinese context.)

Like conditional on InstructGPT competently responding in languages other than English, I think we should just expect it to generalise its goal to the new context.

Is this just hindsight bias? Would I really have been counterfactually surprised/confused if it didn't follow instructions in other languages? I don't actually know that to be the case.

Maybe I would have just generated some other reason why we shouldn't have expected instruction following to generalise.

But hindsight bias aside, this is a prediction of Wentworth's natural abstractions hypothesis as I understand it (I think he would endorse this prediction), and I am a genuine believer in the NAH, so I don't think it's just hindsight bias.


New Answer
Ask Related Question
New Comment

3 Answers sorted by

I think it would be due to the LM in question using lots of language-neutral circuitry? See this paper.

RLHF mostly updates abstract/conceptual circuits, which (I assume) tend to be language neutral, then the language specific circuits just continue translating to/from the updated circuits.

Is RLHF updating abstract circuits an established fact? Why would it suffer from mode collapse in that case?

It's not surprising to me, since language models have done similar things in the past, e.g. learning to translate mainly from unsupervised monolingual data.

That said, I am not sure about explaining it with "natural abstractions". At least, I cannot immediately derive the connection to the natural abstraction arguments. I would not be surprised if there was a connection, but I would also not be surprised if there wasn't a connection. It feels a bit like a Mysterious Answer if I cannot directly derive the connection. But I haven't thought much about it, so it may be obvious if I think harder.

The natural abstraction here is the goal/task, and when the context/environment changes, I'm claiming that the system is likely to generalise to a natural abstraction of the goal/task in the new context/environment.

The natural generalisation of "follow instructions in English" is "follow instructions" and this translates to other language contexts.

John Wentworth has a number of specific arguments about natural abstractions, e.g. resampling, telephone theorem, etc. When you use the term "natural abstraction" to say that it is predictable/expected, are you saying that those arguments predict this outcome? If so, do you have a specific argument in mind?
I feel like this answer glosses over the fact that the encoding changes. Surely you can find some encodings of instructions such that LLMs cannot follow instructions in that encoding. So the question lies in why learning the English encoding also allows the model to learn (say) German encodings.
No? We already know that the model can competently respond in German. Once you condition on the model competently responding in other languages (e.g. for translation tasks) there is no special question about why it follows instructions in other languages as well. Like "why are LLMs capable of translation might be an interesting question", but if you're not asking that question, then I don't understand why you're asking this. My position is that this isn't a special capability that warrants any explanation that isn't covered in an explanation of why/how LLMs are competent translators.
Ah, I misunderstood the content of original tweet - I didn't register that the model indeed had access to lots of data in other languages as well. In retrospect I should have been way more shocked if this wasn't the case. Thanks. I then agree that it's not too surprising that the instruction-following behavior is not dependent on language, though it's certainly interesting. (I agree with Habryka's response below.)

Recent works from Anders Søgaard might be relevant, e.g. Grounding the Vector Space of an Octopus: Word Meaning from Raw Text, Understanding models understanding language, Implications of the Convergence of Language and Vision Model Geometries.

E.g. from  Grounding the Vector Space of an Octopus: Word Meaning from Raw Text on the success of unsupervised machine translation (and more): 

'Consider, for example, the fact that unsupervised machine translation is possible (Lample et al., 2018a, b; Park et al., 2021). Unsupervised machine translation works by first aligning vector spaces induced by monolingual language models in the source and target languages (Søgaard et al., 2019). This is possible because such vector spaces are often near-isomorphic (Vulic et al., 2020). If weak supervision is available, we can use techniques such as Procrustes Analysis (Gower, 1975) or Iterative Closest Point (Besl & McKay, 1992), but aligments can be obtained in the absence of any supervision using adversarial learning (Li et al., 2019; Søgaard et al., 2019) or distributional evidence alone. If the vector spaces induced by language models exhibit high degrees of isomorphism to the physical world or human perceptions thereof, we have reason to think that similar techniques could provide us with sufficient grounding in the absence of supervision.

Unsupervised machine translation show that language model representations of different vocabularies of different languages are often isomorphic. Some researchers have also explored cross-modality alignment: (Chung et al., 2018) showed that unsupervised alignment of speech and written language is possible using the same techniques, for example. This also suggests unsupervised grounding should be possible.

Is there any direct evidence that language model vector spaces are isomorphic to (representations of) the physical world? There is certainly evidence that language models learn isomorphic representations of parts of vocabularies. Abdou et al. (2021), for example, present evidence that language models encode color in a way that is near-isomorphic to conceptual models of how color is perceived, in spite of known reporting biases (Paik et al., 2021). Patel and Pavlick (2022) present similar results for color terms and directionals. Liétard et al. (2021) show that the larger models are, the more isomorphic their representations of geographical place names are to maps of their physical location.'

'Unsupervised machine translation and unsupervised bilingual dictionary induction are evaluated over the full vocabulary, often with more than 85% precision. This indicates language models learn to represent concepts in ways that are not very language-specific. There is also evidence for near-isomorphisms with brain activity, across less constrained subsets of the vocabulary: (Wu et al., 2021), for example, show how brain activity patterns of individual words are encoded in a way that facilitates analogical reasoning. Such a property would in the limit entail that brain encodings are isomorphic to language model representations (Peng et al., 2020). Other research articles that seem to suggest that language model representations are generally isomorphic to brain activity patterns include (Mitchell et al., 2008; Søgaard, 2016; Wehbe et al., 2014; Pereira et al., 2018; Gauthier & Levy, 2019; Caucheteux & King, 2022).'

I'll probably write more about this soon.

6 comments, sorted by Click to highlight new comments since: Today at 6:47 PM

I find myself surprised/confused at his apparent surprise/confusion.

Jan doesn't indicate that he's extremely surprised or confused? He just said he doesn't know why this happens. There's a difference between being unsurprised by something (e.g. by observing something similar before) and actually knowing why it happens.To give a trivial example, hunter gatherers from 10,000 BC would not have been surprised if a lightning strike caused fire, but would be quite clueless (or incorrect) as to why or how this happens.

I think Quintin's answer is a good possible hypothesis (though of course it leads to the further question of how LLMs learn language-neutral circuitry).

I do think that we don't really understand anything in a language model at this level of detail. Like, how does a language model count? How does a language model think about the weather? How does a language model do ROT13 encoding? How does a language model answer physics questions? We don't have an answer to any of these at any acceptable level of detail, so why would we circle out our confusion about the language agnosticity, if indeed we predicted it, we just don't really understand the details the same way we don' really understand the details of anything going on in a large language model.

Later in the thread Jan asks, "is this interpretability complete?" which I think implies that his intuition is that this should be easier to figure out than other questions (perhaps because it seems so simple). But yeah, it's kind of unclear why he is calling out this in particular.

I agree we don't really understand anything in LLMs at this level of detail, but I liked Jan highlighting this confusion anyway, since I think it's useful to promote particular weird behaviors to attention. I would be quite thrilled if more people got nerd sniped on trying to explain such things!

I endorse Habryaka's response.

Like if you take it as a given that InstructGPT competently responds in other languages when the prompts warrants it, then I just don't think there's anything special about following instructions in other languages that merits special explanation?

And following instructions in other languages was singled out as a task that merited special explanation.

There are two issues. One is capability. Why do GPTs have the ability to use other languages? The other is why does the RLFH cause InstructGPT to follow instructions in other languages? The tweet is explicitly about the the second and your question seems to be about the second. But the follow-up tweets suggest that Leike is asking about the first (also?). I think the second is not surprising at all, conditional on the first. But the first is quite mysterious.

New to LessWrong?