Suppose that we want to translate between English and an alien language (Klingon). We have plenty of Klingon text, and separately we have plenty of English text, but it’s not matched up and there are no bilingual speakers.
We train GPT on a mix of English and Klingon text and find that it becomes fluent in both. In some sense this model “knows” quite a lot about both Klingon and English, and so it should be able to read a sentence in one language, understand it, and then express the same idea in the other language. But it’s not clear how we could train a translation model.
Of course some concepts won’t have translations, and the model will often be uncertain about the translation of a term. But we can still ask for a model to explain the meaning of a Klingon expression as best as it can to an English-speaking user. For example, it could say “This is an idiomatic expression that’s often used to express great uncertainty” or “This is a small animal that is familiar to most Klingon speakers, I think it’s kind of like a frog but am not really sure” rather than translating a sentence directly.
How can we construct an objective that incentivizes the model to “try its best” at this translation task?
There are many published heuristics for unsupervised translation (e.g. Lample et al). I don’t think those techniques should completely satisfy us:
- Existing methods can’t lead to a model that appropriately describes its uncertainty or talks the user through a hard-to-translate expression. (At least as far as I’m aware.)
- We have no real reason to think existing methods fully utilize the model’s understanding, or to expect those methods to scale well. (In practice, I think they are impressive but still lag behind the quality of our models’ understanding.)
- These heuristics are specific to translation, whereas we’d like to find general methods that can scale up to harder problems.
Existing alignment techniques
If we try to apply RL from human feedback to translation, we immediately run into a problem: how am I supposed to judge which of two English explanations of a Klingon sentence is better, given that I don’t know Klingon?
Debate doesn’t easily address this difficulty either — if one model claims that “qapla” means “great success” and the other claims it means “minor success,” I can’t easily decompose that disagreement into simpler sub-questions that debaters disagree about. Debaters could cite phrases in the database where “qapla” is used, but they’d need to average weak evidence over many phrases. Making things worse, to interpret each usage they’d need to agree about the meaning of the rest of the phrase — -which isn’t necessarily any simpler than the original disagreement about “qapla.” Even if this process was possible, it’s not at all clear that GPT would be able to do it — -being able to translate between Spanish and English doesn’t mean I have an encyclopedic knowledge of all the documents from which I built up my intuitive sense of a particular word’s meaning (which I’d need in order to win such a debate).
Right now I don’t think we have any scalable strategies to this kind of problem; I think it’s a core open question for alignment.
Unsupervised translation seems like a good problem to think about for alignment
I think the key feature of this situation is that our model has acquired a bunch of intuitions about the domain which are only justified empirically — the model “knows” about the meaning of phrases only insofar as it has a very complex hypothesis that was supported by the data.
This situation is going to become increasingly common as we train more powerful models, and will immediately be a real problem if we are applying human feedback to fine-tune GPT; while GPT is subhuman in many ways, it’s already acquired plenty of knowledge that any given human contractor would lack.
Most of GPT’s knowledge is something that came from some human, but ultimately we will be training models that generate new knowledge (e.g.by searching over plans in realistic environments, or by writing code on their own and learning about what works), and no human will have that knowledge. So we can’t hope to get around this problem by simply hiring more knowledgeable contractors.
This can leave us in a situation where it’s extremely difficult for humans to oversee AI decisions. If a model says “My intuition is that this business plan will make a lot of money” the user will need to decide whether or not to trust it. If they don’t, then they may find themselves at an increasing economic disadvantage. If they do, then they may have lost the ability to effectively oversee AI systems except by evaluating the consequences of their actions. That leads directly into the classical challenges of AI safety, namely that AI systems evaluated exclusively on the basis of measured outcomes have a tendency to push the world in undesirable directions (since we can’t measure what we care about) and to corrupt our measurements.
My vague hope
I’m hoping we can address this using the kind of approach discussed in learning the prior. That might look like:
- In parallel with training GPT, train a helper model that explains the meaning of phrases (it can also provide other intuitions or background facts that are useful for predicting the next word).
- As we train on Klingon text, we sample phrases and then ask a human “which word will come next?” The human uses the helper model to understand what is being discussed and make a prediction.
- We optimize the helper model to make the human’s next-word predictions good (in parallel with generative pre-training).
- Finally, a human uses the same helper model to evaluate a proposed Klingon → English translation, and we use this to train a translator by RL.
That short description sweeps a lot of complexity under the rug. Most importantly, the success of the scheme relies on the correctness of the prior over helper models (or else the helper could just be another copy of GPT-Klingon), and we don’t have a credible strategy for representing and manipulating our prior over complex programs.
Overall, I’d say that this is more at the level of “vague hope” rather than “concrete proposal.” I think it’s an open question whether anything in this space will work.
I think that this is the kind of problem which makes e.g. MIRI researchers justifiably skeptical that scalable ML alignment is possible at all, and it’s the main focus of my current conceptual work on AI alignment. I’m glad that this kind of theoretical crux also looks like it will soon be relevant to ML practice, since I think it will make it much easier to close the gap between people who work on ML and people who work on alignment.