Alignment as Translation

by johnswentworth 4 min read19th Mar 202034 comments


Ω 22

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Technology Changes Constraints argues that economic constraints are usually modular with respect to technology changes - so for reasoning about technology changes, it’s useful to cast them in terms of economic constraints. Two constraints we’ll talk about here:

  • Compute - flops, memory, etc.
  • Information - sensors, data, etc.

Thanks to ongoing technology changes, both of these constraints are becoming more and more slack over time - compute and information are both increasingly abundant and cheap.

Immediate question: what happens in the limit as the prices of both compute and information go to zero?

Essentially, we get omniscience: our software has access to a perfect, microscopically-detailed model of the real world. Computers have the memory and processing capability to run arbitrary queries on that model, and predictions are near-perfectly accurate (modulo quantum noise). This limit applies even without AGI - as compute and information become more abundant, our software approaches omniscience, even limiting ourselves to special-purpose reasoning algorithms.

Of course, AGI would presumably be closer to omniscience than non-AGI algorithms, at the same level of compute/information. It would be able to more accurately predict more things which aren’t directly observable via available sensors, and it would be able to run larger queries with the same amount of compute. (How much closer to omniscience an AGI would get is an open question, but it would at least not be any worse in a big-O sense.)

Next question: as compute and information constraints slacken, which constraints become taut? What new bottlenecks appear, for problems which were previously bottlenecked on compute/information?

To put it differently: if our software can run arbitrary queries on an accurate, arbitrarily precise low-level model of the physical world, what else do we need in order to get value out of that capability?

Well, mainly we need some way to specify what it is that we want. We need an interface.

Our highly accurate low-level world model can tell us anything about the physical world, but the things-we-want are generally more abstract than molecules/atoms/fields. Our software can have arbitrarily precise knowledge and predictive power on physical observables, but it still won’t have any notion that air-pressure-oscillations which sound like the word “cat” have something to do with the organs/cells/biomolecules which comprise a cat. It won’t have built-in any notion of “tree” or “rock” or “human” - using such high-level abstractions would only impede predictive power, when we could instead model the individual components of such high-level objects.

It’s the prototypical interface problem: the structure of a high-precision world-model generally does not match the structure of what-humans-want, or the structure of human abstractions in general. Someone/something has to translate between the structures in order to produce anything useful.

As I see it, this is the central problem of alignment.

Some Approaches

Default: Humans Translate

Without some scalable way to build high-level world models out of low-level world models, we constantly need to manually translate things-humans-want into low-level specifications. It’s an intellectual-labor-intensive and error-prone process; writing programs in assembly code is not just an analogy but an example. Even today’s “high-level programming languages” are much more structurally similar to assembly code than to human world-models - Python has no notion of “oak tree”.

An analogy: translating high-level structure into low-level specification the way we do today is like translating English into Korean by hand.

Humans Translate Using Better Tools

It’s plausible (though I find it unlikely) that we could tackle the problem by building better tools to help humans translate from high-level to low-level - something like much-higher-level programming languages. I find it unlikely because we’d probably need major theoretical breakthroughs - for instance, how do I formally define “tree” in terms of low-level observables? Even if we had ways to do that, they’d probably enable easier strategies than building better programming languages.

Analogy: it’s like translating by hand from English to Korean, but with the assistance of a dictionary, spell-checker, grammar-checker, etc. But if we had an English-Korean dictionary, we'd probably be most of the way to automated translation anyway (in this respect, the analogy is imperfect).

Examples + Interpolation

Another path which is plausible (though I find it unlikely) is something like programming-by-example - not unlike today’s ML. This seems unlikely to work from both an inside and outside view:

  • Inside view: the whole problem in the first place is that low-level structure doesn’t match high-level structure, so there’s no reason to expect software systems to interpolate along human-intuitive dimensions.
  • Outside view: programming-by-example (and today’s ML with it) is notoriously unreliable.

Examples alone aren’t enough to make software reliably carve reality at the same joints as humans. There probably are some architectures which would reliably carve at the same joints as humans - different humans tend to chunk the world into similar objects, after all. But figuring out such an architecture would take more than just throwing lots of data at the problem.

To put it differently: the way-in-which-we-want-things-translated is itself something which needs to be translated. A human’s idea-of-what-constitutes-a-“good”-low-level-specification-of-“oak tree” is itself pretty high-level and abstract; that idea itself needs to be translated into a low-level specification before it can be used. If we’re trying to use examples+interpolation, then the interpolation algorithm is our “specification” of how-to-translate… and it probably isn’t a very good translation of our actual high-level idea of how-to-translate.

Analogy: it’s like teaching English to Korean speakers by pointing to trees and saying “tree”, pointing to cars and saying “car”, etc… except that none of them actually realize they’re supposed to be learning another language. The Korean-language instructions they received were not actually a translation of the English explanation “learn the language that person is speaking”.


A small tweak to the previous approach: train a reinforcement learner.

The analogy: rather than giving our Korean-speakers some random Korean-language instructions, we don't give them any instructions - we just let them try things, and then pay them when they happen to translate things from English to Korean.

Problem: this requires some way to check that the translation was correct. Knowing what to incentivize is not any easier than specifying what-we-want to begin with. Rather than translating English-to-Korean, we’re translating English-to-incentives.

Now, there is a lot of room here for clever tricks. What if we verify the translation by having one group translate English-to-Korean, another group translate back, and reward both when the result matches the original? Or taking the Korean translation, giving it to some other Korean speaker, and seeing what they do? Etc. These are possible approaches to translating English into incentives, within the context of the analogy.

It’s possible in principle that translating what-humans-want into incentives is easier than translating into low-level specifications directly. However, if that’s the case, I have yet to see compelling evidence - attempts to specify incentives seem plagued by the same surprising corner cases and long tail of difficult translations as other strategies.

AI Translates

This brings us to the obvious general answer: have the AI handle the translation from high-level structure to low-level structure. This is probably what will happen eventually, but the previous examples should make it clear why it’s hard: an explanation of how-to-translate must itself be translated. In order to make an AI which translates high-level things-humans-want into low-level specifications, we first need a low-level specification of the high-level concept “translate high-level things-humans-want into low-level specifications”.

Continuing the earlier analogy: we’re trying to teach English to a Korean speaker, but that Korean speaker doesn’t have any idea that they’re supposed to be learning another language. In order to get them to learn English, we first need to somehow translate something like “please learn this language”.

This is a significant reduction of the problem: rather than translating everything by hand all the time, we just need to translate the one phrase “please learn this language”, and then the hard part is done and we can just use lots of examples for the rest.

But we do have a chicken-and-egg problem: somehow, we need to properly translate that first phrase. Screw up that first translation, and nothing else will work. That part cannot be outsourced; the AI cannot handle the translation because it has no idea that that’s what we want it to do.


Ω 22