Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Suppose that we want to translate between English and an alien language (Klingon). We have plenty of Klingon text, and separately we have plenty of English text, but it’s not matched up and there are no bilingual speakers.

We train GPT on a mix of English and Klingon text and find that it becomes fluent in both. In some sense this model “knows” quite a lot about both Klingon and English, and so it should be able to read a sentence in one language, understand it, and then express the same idea in the other language. But it’s not clear how we could train a translation model.

Of course some concepts won’t have translations, and the model will often be uncertain about the translation of a term. But we can still ask for a model to explain the meaning of a Klingon expression as best as it can to an English-speaking user. For example, it could say “This is an idiomatic expression that’s often used to express great uncertainty” or “This is a small animal that is familiar to most Klingon speakers, I think it’s kind of like a frog but am not really sure” rather than translating a sentence directly.

How can we construct an objective that incentivizes the model to “try its best” at this translation task?

Translation-specific approaches

There are many published heuristics for unsupervised translation (e.g. Lample et al). I don’t think those techniques should completely satisfy us:

  • Existing methods can’t lead to a model that appropriately describes its uncertainty or talks the user through a hard-to-translate expression. (At least as far as I’m aware.)
  • We have no real reason to think existing methods fully utilize the model’s understanding, or to expect those methods to scale well. (In practice, I think they are impressive but still lag behind the quality of our models’ understanding.)
  • These heuristics are specific to translation, whereas we’d like to find general methods that can scale up to harder problems.

Existing alignment techniques

If we try to apply RL from human feedback to translation, we immediately run into a problem: how am I supposed to judge which of two English explanations of a Klingon sentence is better, given that I don’t know Klingon?

Debate doesn’t easily address this difficulty either — if one model claims that “qapla” means “great success” and the other claims it means “minor success,” I can’t easily decompose that disagreement into simpler sub-questions that debaters disagree about. Debaters could cite phrases in the database where “qapla” is used, but they’d need to average weak evidence over many phrases. Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.” Even if this process was possible, it’s not at all clear that GPT would be able to do it — -being able to translate between Spanish and English doesn’t mean I have an encyclopedic knowledge of all the documents from which I built up my intuitive sense of a particular word’s meaning (which I’d need in order to win such a debate).

Right now I don’t think we have any scalable strategies to this kind of problem; I think it’s a core open question for alignment.

Unsupervised translation seems like a good problem to think about for alignment

I think the key feature of this situation is that our model has acquired a bunch of intuitions about the domain which are only justified empirically — the model “knows” about the meaning of phrases only insofar as it has a very complex hypothesis that was supported by the data.

This situation is going to become increasingly common as we train more powerful models, and will immediately be a real problem if we are applying human feedback to fine-tune GPT; while GPT is subhuman in many ways, it’s already acquired plenty of knowledge that any given human contractor would lack.

Most of GPT’s knowledge is something that came from some human, but ultimately we will be training models that generate new knowledge (e.g.by searching over plans in realistic environments, or by writing code on their own and learning about what works), and no human will have that knowledge. So we can’t hope to get around this problem by simply hiring more knowledgeable contractors.

This can leave us in a situation where it’s extremely difficult for humans to oversee AI decisions. If a model says “My intuition is that this business plan will make a lot of money” the user will need to decide whether or not to trust it. If they don’t, then they may find themselves at an increasing economic disadvantage. If they do, then they may have lost the ability to effectively oversee AI systems except by evaluating the consequences of their actions. That leads directly into the classical challenges of AI safety, namely that AI systems evaluated exclusively on the basis of measured outcomes have a tendency to push the world in undesirable directions (since we can’t measure what we care about) and to corrupt our measurements.

My vague hope

I’m hoping we can address this using the kind of approach discussed in learning the prior. That might look like:

  • In parallel with training GPT, train a helper model that explains the meaning of phrases (it can also provide other intuitions or background facts that are useful for predicting the next word).
  • As we train on Klingon text, we sample phrases and then ask a human “which word will come next?” The human uses the helper model to understand what is being discussed and make a prediction.
  • We optimize the helper model to make the human’s next-word predictions good (in parallel with generative pre-training).
  • Finally, a human uses the same helper model to evaluate a proposed Klingon → English translation, and we use this to train a translator by RL.

That short description sweeps a lot of complexity under the rug. Most importantly, the success of the scheme relies on the correctness of the prior over helper models (or else the helper could just be another copy of GPT-Klingon), and we don’t have a credible strategy for representing and manipulating our prior over complex programs.

Overall, I’d say that this is more at the level of “vague hope” rather than “concrete proposal.” I think it’s an open question whether anything in this space will work.

I think that this is the kind of problem which makes e.g. MIRI researchers justifiably skeptical that scalable ML alignment is possible at all, and it’s the main focus of my current conceptual work on AI alignment. I’m glad that this kind of theoretical crux also looks like it will soon be relevant to ML practice, since I think it will make it much easier to close the gap between people who work on ML and people who work on alignment.


“Unsupervised” translation as a safety problem was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 10:18 AM
[-]gjm4y130

To me the title implies that bad translation could be dangerous whereas I think the actual intention is that machine translation involves difficulties that parallel those of AGI that lead to danger.

Not all intent alignment problems involve existential risk.

[-]gjm4y40

I was sure the title used to say "safety problem", not "intent alignment problem". Did I imagine it?

[EDITED to add:] No, I didn't. Right now one can see the original wording over at https://ai-alignment.com/ but since I wrote the above, the version here has had the title silently changed.

(By the way, I didn't say anything about existential risk; there are dangers that fall short of existential risk.)

Good point, changed.

Originally it was "as an alignment problem" but this has the problem that it also refers to "aligning" unaligned datasets. The new way is bulkier but probably better overall.

For those interested, here is a more accessible explanation of the paper on unsupervised  translation with monoligual corpora:
https://yashuseth.blog/2019/03/03/how-can-unsupervised-neural-machine-translation-work/

Maybe we can ask GPT to output English-Klingon dictionary? 

[-]Ofer4yΩ330

Some tentative thoughts:

Re Debate:

Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.” 

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

Also, one might try to use GPT to complete prompts such as:

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means 

In both of these approaches we still need to deal with the potential problem of catastrophic inner alignment failures occurring before the point where we have sufficiently useful helper models. [EDIT: and in the Debate-based approach there's also an outer alignment problem: a player may try to manipulate the judge into choosing them as the winner.]

The researchers analyzed the Klingon phrase "מהדקי נייר" and concluded it roughly means 

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Consider a Debate experiment in which each of the two players outputs an entire English-Klingon dictionary (as avturchin mentioned). The judge then samples a random Klingon passage and decides which of the two dictionaries is more helpful for understanding that passage (maybe while allowing the two players to debate over which dictionary is more helpful).

This is basically what the helper model does, except:

  • For competitiveness you should learn and evaluate the dictionary at the same time you are training the model, running a debate experiment many times where debaters have to output a full dictionary would likely be prohibitively expensive.
  • Most knowledge about language isn't easily captured in a dictionary (for example, a human using a Spanish-English dictionary is a mediocre translator), so we'd prefer have a model that answers questions about meaning than have a model that outputs a static dictionary.
  • I don't know what standard you want to use for "helpful for understanding the passage" but I think "helps predict the next word correctly" is probably the best approach (since the goal is to be competitive and that's how GPT learned).

After making those changes we're back at the learning the prior proposal.

I think that proposal may work passably here because we can potentially get by with a really crude prior---basically we think "the helper should mostly just explain the meaning of terms" and then we don't need to be particularly opinionated about which meanings are more plausible. I agree that the discussion in the section "A vague hope" is a little bit too pessimistic for the given context of unaligned translation.

[-]Ofer4yΩ110

If the model is smart, this is only going to work if the (correct) translation is reasonably likely to appear in your English text database. You are (at best) going to get a prediction of what human researchers would conclude after studying Klingon, your model isn't actually going to expand what humans can do.

Agreed. Perhaps it's possible to iteratively train GPT models in an Amplification-like setup, where in each iteration we add to the English training corpus some newly possible translations; aiming to end up with something like an HCH translator. (We may not need to train a language model from scratch in each iteration; at the extreme, we just to do fine-tuning on the new translations.)

Planned summary for the Alignment Newsletter:

We have previously seen that a major challenge for alignment is that our models may learn <@inaccessible information@>(@Inaccessible information@) that we cannot extract from them, because we do not know how to provide a learning signal to train them to output such information. This post proposes unsupervised translation as a particular concrete problem to ground this out.

Suppose we have lots of English text, and lots of Klingon text, but no translations from English to Klingon (or vice versa), and no bilingual speakers. If we train GPT on the text, it will probably develop a good understanding of both English and Klingon, such that it “should” have the ability to translate between the two (at least approximately). How can we get it to actually (try to) do so? Existing methods (both in unsupervised translation and in AI alignment) do not seem to meet this bar.

One vague hope is that we could train a helper agent such that a human can perform next-word prediction on Klingon with the assistance of the helper agent, using a method like the one in Learning the prior (AN #109).

A (perhaps obvious) idea that this brings to mind -- you could use GPT to produce endless training data for translation:

  1. Train GPT on both English and Klingon.
  2. Select some English text that you have the seed of a Klingon translation for. (Where the seed might be just a few tokens, or it might be many paragraphs. In any case it's meant to be a prefix of a complete translation.)
  3. Use your translator model (which is maybe another head on the same GPT-ish network) to translate from English to Klingon, given only the part of the English text beyond what was translated for the seed.
  4. Feed the seed translation to GPT and ask it how surprised it is if that text is followed by the output of your translator.
  5. Use that surprisal as a loss function for the translator.

In other words, split some English text into a prefix and suffix. We assume by induction that we already have a translation for the prefix. Train a translator model to translate the suffix into something that GPT (trained on Klingon) finds unsurprising. (You could also try training the translator to exactly match whatever continuation GPT finds most likely, but my guess is that comparing the token by token difference between the outputs of the translator and GPT is a less good training signal than GPT's surprisal at the translator's output.)

Assuming GPT-trained-on-English and GPT-trained-on-Klingon are both good, it seems to me (in my uninformed layman's opinion) that you could likely bootstrap to good translations using this procedure, even w/ no parallel text, and w/ a translator that starts out by outputting gibberish.

Perhaps this is in some sense equivalent (or at least analogous) to the techniques used in the linked Lample et al paper, but I thought it was interesting to think about in this way, with two parallel GPTs and a translator learning to translate between them.

EDIT: This is not meant to answer Paul's question of how you could train a system to explain itself. Just a thought that the post brought to mind.

EDIT2: I suppose you could also train the translator on its surprisal at GPT's output. (Or maybe I shouldn't think of it as output and surprisal -- since the output is sampled from a probability distribution, maybe you just compare the two distributions? I suppose this comes down to the details of the two networks and the exact format of their outputs.)

What encourages the helper model to generate correct explanations as opposed to false/spurious ones?

I.e., let's say the text is a list of fruit, and the correct next word is Klingon for "pineapple". I'm imagining that the helper model could just say "The next word is [Klingon for pineapple]" or give an alternate/spurious explanation of the Klingon text ("The text is discussing a spiky fruit that goes on pizza"). Both of the above, unhelpful/spurious explanations would make me predict the next Klingon word correctly.

My attempt to summarize the alignment concern here. Does this seem a reasonable gloss?

It seems plausible that competitive models will not be transparent or introspectable. If you can't see how the model is making decisions, you can't tell how it will generalize, and so you don't get very good safety guarantees. Or to put it another way, if you can't interact with the way the model is thinking, then you can't give a rich enough reward signal to guide it to the region of model space that you want

This seems not false but it also seems like you're emphasizing the wrong bits, e.g. I don't think we quite need the model to be transparent/"see how it's making decisions" to know how it will generalize. 

At some point, model M will have knowledge that should enable it to do X task. However, it's currently unclear how one would get M to do X in a way that doesn't implicitly trust the model to be doing something it might not be doing. It's a core problem of prosaic alignment to figure out how to get M to do X in a way that allows us to know that M is actually doing X-and-only-X instead of something else. 

Most importantly, the success of the scheme relies on the correctness of the prior over helper models (or else the helper could just be another copy of GPT-Klingon)

I'm not sure I understand this. My understanding of the worry: what if there's some equilibrium where the model gives wrong explanations of meanings, but I can't tell using just the model to give me meanings.

But it seems to me that having the human in the loop doing prediction helps a lot, even with the same prior. Like, if the meanings are wrong, then the user will just not predict the correct word. But maybe this is not enough corrective data?