Communication Prior as Alignment Strategy

johnswentworth

Alice has one of three objects:

A red triangle
A blue square
A red circle

She wants Bob to learn which object she has. However, Alice may only send one of three messages:

“My object is round”
“My object is red”
“This is not a pipe”

The rules of the game (i.e. the available messages) are common knowledge before the game starts. What message should Alice send for each object, and what object should Bob deduce from each message?

Let’s think it through from Bob’s standpoint. A clever human might reason like this:

“My object is round” implies it’s the red circle, because that’s the only round object.
“My object is red” implies it’s the red triangle, because only the triangle and circle are red, and Alice could have perfectly conveyed the information with “My object is round” if it were the circle.
“This is not a pipe” implies it’s the blue square, because Alice could have perfectly conveyed the information with one of the other two messages otherwise.

If you’ve played the game CodeNames, then this sort of reasoning might look familiar: "well, 'blue' seems like a good hint for both sky+sapphire and sky+water, but if it were sky+water they would have said 'weather' or something like that instead, so it's probably sky+sapphire...".

Intuitively, this sort of reasoning follows from a communication prior - a prior that someone is choosing their words in order to communicate. In everyday life, this comes up in e.g. the difference between connotation and denotation: when someone uses a connotation-heavy word, the fact that they used that word rather than some more neutral synonym is itself important information. More generally: the implication of words is not the same as their literal content. A communication prior contains a model of how-and-why-the-words-were-chosen, so we can update on the words to figure out their implications, not just their literal meanings.

Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button - the AI would try to figure out what we mean, rather than just doing what we say. Indeed, we could even have the AI treat its own source code as a signal about what I mean, rather than as instructions to be taken literally - potentially recognizing when the program we wrote is not quite the program we intended, and doing what we intended instead. (Obviously the program itself would need some open-ended introspection/self-modification capabilities to allow this.) As long as the initial code and initial model of me is “close enough”, the AI could figure out what I meant, and we’d have a “basin of convergence” - any close-enough code/model would converge to what we actually intended.

Of course, that all requires formalizing communication priors. This post sketches out a relatively simple version based on the Alice/Bob example above, then talks about the more complicated version needed for alignment purposes, and about what the approach does and does not do.

Formalizing a Communication Prior

We’ll continue to use the Alice/Bob example with the colored shapes from earlier, though we’ll use more general formulas. We’ll call the message and the intended meaning (i.e. object) $X$ .

Our receiver (i.e. Bob) starts with some naive guess at the meaning $X$ , just based on the literal content of the message - i.e. “My object is red” would, taken literally, imply that it’s either the triangle or the circle. We’ll write this naive guess as

$P_{0} [X | ‘ ‘ M ”] = \frac{1}{P [M]} P [M | X] P [X]$

This is basically just a Bayesian update. The only subtlety is the quotes around $‘ ‘ M ”$ - this makes a distinction between the message $‘ ‘ M ”$ (i.e. the letters “My object is red” on a screen) and the literal meaning $M$ of the message (the fact that the object is red). The formula says that the naive guess at the intended meaning given the message (i.e. $P_{0} [X | ‘ ‘ M ”]$ ) is just a Bayesian update on the literal meaning of the message.

At this stage, assuming a uniform prior on the three objects, Bob would say that:

“My object is round” means it’s the circle
“My object is red” gives ½ chance each to circle and triangle
“This is not a pipe” gives ⅓ chance to each object

But at this point, Bob hasn’t accounted for all his information. He also knows that Alice chose the message to maximize the chance that Bob would guess the right object. So, let’s do another Bayesian update on the assumption that Alice chose the message to maximize the probability assigned to $X$ under $P_{0}$ .

$P_{1} [X | ‘ ‘ M ”] = \frac{1}{Z} P [(‘ ‘ M ” m a x i m i z e s P_{0} [X | ‘ ‘ M ”]) | X] P_{0} [X | ‘ ‘ M ”]$

(Side note: $Z$ here is a generic symbol for the normalizer in the update, which would normally be $P [‘ ‘ M ”]$ . I’ll continue to use it going forward, since the exact things we’re implicitly conditioning $P [‘ ‘ M ”]$ on can be a bit confusing in a way which doesn’t add anything.) This is another Bayesian update, but this time starting from $P_{0}$ rather than the original prior. At this stage, Bob would say that:

“My object is round” means it’s the circle
“My object is red” means it’s the triangle, since “My object is red” is not the message which gives highest $P_{0} [X | ‘ ‘ M ”]$ when X is the circle.
“This is not a pipe” means it’s the square, since “This is not a pipe” would not give the highest $P_{0} [X | ‘ ‘ M ”]$ when X is the circle or triangle.

Let’s do one more step, just to illustrate. Bob still hasn’t used all his information - it’s not just that Alice chose the message to maximize the probability assigned to $X$ under $P_{0}$ , she also chose it to maximize the probability assigned to $X$ under $P_{1}$ . How did she choose the message to maximize both of these simultaneously? Well, given our formulas above, if $‘ ‘ M ”$ maximizes $P_{1} [X | ‘ ‘ M ”]$ , then that implies that $‘ ‘ M ”$ maximizes $P_{0} [X | ‘ ‘ M ”]$ as well. However, the implication does not go back the other way in general; the fact that $‘ ‘ M ”$ maximizes $P_{1} [X | ‘ ‘ M ”]$ is stronger.

Intuitively, we’re “ruling out” messages for each $X$ at each stage. Any message not ruled out at stage 1 was also not ruled out at stage 0 - the messages “not ruled out” for $X$ are precisely those which assign maximal probability to $X$ at all earlier stages.

Upshot: by choosing $X$ to maximize $P_{1} [X | ‘ ‘ M ”]$ , Alice also implicitly chose $X$ to maximize $P_{0} [X | ‘ ‘ M ”]$ .

Anyway, next step: we form $P_{2} [X | ‘ ‘ M ”]$ by updating on the fact that $‘ ‘ M ”$ maximizes the probability assigned to $X$ under $P_{1}$ .

$P_{2} [X | ‘ ‘ M ”] = \frac{1}{Z} P [(‘ ‘ M ” m a x i m i z e s P_{1} [X | ‘ ‘ M ”]) | X] P_{0} [X | ‘ ‘ M ”]$

Note that we’re still using $P_{0}$ as our prior in this update; that’s to avoid double-counting the fact that Alice is maximizing $P_{0}$ , while still accounting for the literal content $M$ . If we continue the chain, each subsequent step will look like

$P_{k + 1} [X | ‘ ‘ M ”] = \frac{1}{Z} P [(‘ ‘ M ” m a x i m i z e s P_{k} [X | ‘ ‘ M ”]) | X] P_{0} [X | ‘ ‘ M ”]$

In this case, we find that $P_{2}$ is exactly the same as $P_{1}$ - the calculation has converged in finite time. More generally, we can say that Bob’s final probabilities should be

$P [X | ‘ ‘ M "] = P_{\infty} [X | ‘ ‘ M ”] = l i m_{k \to \infty} P_{k} [X | ‘ ‘ M ”]$

As a Fixed Point

The argument above is very meta, and hard to follow. We can simplify it by using a fixed point argument instead.

Instead of the whole sequence of updates, we’ll just start from $P_{0}$ (i.e. the literal content of the message), and update in a single step on the fact that Alice is optimizing the message: Alice chooses the message $‘ ‘ M ”$ to maximize the final probability $P [X | ‘ ‘ M ”]$ .

$P [X | ‘ ‘ M ”] = \frac{1}{Z} P [(‘ ‘ M ” m a x i m i z e s P [X | ‘ ‘ M ”]) | X] P_{0} [X | ‘ ‘ M ”]$

This is a fixed-point formula for $P [X | ‘ ‘ M ”]$ . Formally, the “communication prior” itself is $(‘ ‘ M ” m a x i m i z e s P [X | ‘ ‘ M ”])$ .

This is intuitively simple, but unfortunately $P [X | ‘ ‘ M ”]$ is extremely underdetermined by the fixed-point formula; there are many possible $P [X | ‘ ‘ M ”]$ we could choose, and $l i m_{k \to \infty} P_{k} [X | ‘ ‘ M ”]$ is just one of them. Intuitively: we could map messages to objects any way we want, as long as we respect the literal content of the message. As long as Alice and Bob both know the mapping, we choose $P [X | ‘ ‘ M ”]$ according to the mapping, and everything works out.

The fixed point formula is a criterion which any winning strategy must satisfy, but there are still many winning strategies.

Our particular choice of $P [X | ‘ ‘ M ”] = l i m_{k \to \infty} P_{k} [X | ‘ ‘ M ”]$ comes from iteratively expanding the fixed-point formula, with initial point $P_{0}$ . If either Alice or Bob decides to use this model, and the other knows that they’re using it, then it’s locked in.

More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our $P_{k} [X | ‘ ‘ M ”]$ sequence shows what that tower looks like for one particular model of Alice/Bob.

Beyond Idealized Agents

The $(‘ ‘ M ” m a x i m i z e s P [X | ‘ ‘ M ”])$ communication prior is where Alice and Bob’s models of each other enter. In this case, we’re effectively assuming that Alice is a perfect agent - i.e. she picks her message to perfectly optimize Bob’s posterior. This is an idealized communication prior for idealized agents.

For alignment, we instead want a model of how humans communicate - as people who’ve played CodeNames can confirm, humans do not reliably think through many levels of implications of their word-choices! We really want to update on something like (<rough-model-of-human> thinks $‘ ‘ M ”$ results in high $P [X | ‘ ‘ M ”]$ ). The better the model of how the human chose $‘ ‘ M "$ based on what they want, the better the AI will be able to guess what we want (i.e. $X$ ) from our “messages”.

To the extent that the AI is modelling the human modelling the AI, we still get the meta-tower and possibly a fixed point formula (depending on how good the model of the AI in the human’s head is). The AI can treat both its own code and the human-model as “messages”, and so potentially correct sufficiently-small errors in them.

What This Does And Does Not Do

In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough pointer to human values. We’d still like an AI architecture with goals stable under successor-construction. For maximum safety, we’d still ideally like some good-enough scaled-down tests and/or proofs that some subcomponents actually work the way we intuitively expect. Etc.

What this does buy us is a basin of convergence. On all of the key pieces, we just need to be “close enough” for the whole thing to work. Potentially being able to recover even from small bugs in the source code is a pretty nice selling point. Of course, there are probably basins of convergence for many approaches, but this one offers at least the possibility of being able to explicitly model the basin. How sensitive is the end result to errors along different dimensions of the human-model? That’s the sort of question which could be addressed (either theoretically or empirically) in toy models along these lines, and potentially lead to generalizable insights about which pieces matter more or less. In other words: we could potentially say things about how big the basin of convergence is, and along which directions it’s wide/narrow.

That said, I still think the biggest blocker - both for this approach and many others - is figuring out pointers to human values, and how pointers to real-world abstract objects/concepts work more generally. Right now, we don’t even understand the type-signature of a “pointer” in this sense, so it’s rather difficult to talk about a basin-of-convergence for human-value-pointers.

Yeah, this is a pretty common technique at CHAI (relevant search terms: pragmatics, pedagogy, Gricean semantics). Some related work:

Showing versus Doing: Teaching by Demonstration (differences when you ask humans to teach vs. demonstrate a behavior)
Inverse Reward Design (interpret the stated reward function as a message, not a reward function)
Cooperative Inverse Reinforcement Learning (formalizing the interaction as a game)
Literal or Pedagogic Human? Analyzing Human Model Misspecification in Objective Learning (do you benefit from a pedagogic assumption? It turns out the literal assumption has robustness benefits, presumably because it doesn't rule out possibilities that it turns out the human does sometimes consider)
Preferences Implicit in the State of the World (interpret the state of the world as a message).

I agree that it should be possible to do this over behavior instead of rewards, but behavior-space is much larger or more complex than reward-space and so it would require significantly more data in order to work as well.

I don't think it can be significantly harder for behavior-space than reward-space. If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing". I don't think that's actually the right way to do things, but it should at least give a reduction of the problem.

Anyway, I'd say the most important difference between this and various existing strategies is that we can learn "at the outermost level". We can treat the code as message, so there can potentially be a basin of attraction even for bugs in the code. The entire ontology of the agent-model can potentially be wrong, but still end up in the basin. We can decide to play an entirely different game. Some of that could potentially be incorporated into other approaches (maybe it has and I just didn't know about it), though it's tricky to really make everything subject to override later on.

Of course, the trade-off is that if everything is subject to override then we really need to start in the basin of attraction - there's no hardcoded assumptions to fall back on if things go off the rails. Thus, robustness tradeoff.

If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing".

Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.

Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically "how do you make sure that your model of communicating humans infers the right things about human preferences?", both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can't put complete confidence in any single model.

Roughly, yeah, though there are some differences - e.g. here the AI has no prior "directly about" values, it's all mediated by the "messages", which are themselves informing intended AI behavior directly. So e.g. we don't need to assume that "human values" live in the space of utility functions, or that the AI is going to explicitly optimize for something, or anything like that. But most of the things which are hard in CIRL are indeed still hard here; it doesn't really solve anything in itself.

One way to interpret it: this approach uses a similar game to CIRL, but strips out most of the assumptions about the AI and human being expected utility maximizers. To the extent we're modelling the human as an optimizer, it's just an approximation to kick off communication, and can be discarded later on.

I am confused about the opening of your analysis:

In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough pointer to human values.

It seems to me that while the fixed point conception here doesn't uniquely determine a learning strategy, it should be possible to uniquely determine that strategy by building it into the training data.

In particular, if you have a base level of "reality" like the P_0 you describe, then it should be possible to train a model first on this reality, then present it with training scenarios that start by working directly on the "verifiable reality" subset, then build to "one layer removed" and so on.

My (very weak) shoulder-John says that just because this "feels like it converges" doesn't actually make any guarantees about convergence, but since P_0, P_1, etc. are very well specified it feels like a more approachable problem to try to analyze a specific basis of convergence. If one gets a basis of convergence, AND an algorithm for locating that basis of convergence, that seems to me sufficient for object-level honesty, which would be a major result.

I'm curious if you disagree with:

The problem of choosing a basis of convergence is tractable (relative to alignment research in general)
The problem of verifying that AI is in the basis of convergence is tractable
Training an AI into a chosen basis of convergence could enforce that AI to be honest on the object level when object level honesty is available
Object level honesty is not a major result, for example because not enough important problems can be reduced to object level or because it is already achievable

Writing that out, I am guessing that 2 may be a disagreement that I still disagree with (e.g. you may think it is not tractable), and 3 may contain a disagreement that is compelling and hard to resolve (e.g. you may think we cannot verify which basis of convergence satisfies our honesty criteria--my intuition is that this would require not having a basis of convergence at all).

I just want to mention that my recent critique of the definition of communication used here does not imply that this is any more inadequate for alignment than your remarks here suggest; in order to "do what I mean, not what I say," we actually want to include connotations and implicature rather than only the literal meaning.

That being said, a theory of meaning which addressed the critique might potentially open the path for a definition much better than the one here. In particular, it might help address the question of what ontology the beliefs should even be in (in order to represent human values etc).

Nice post! It was clear, and I agree that knowing more about the basin of attraction is useful. I also like that you caveat the usefulness of this idea yourself.

Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button - the AI would try to figure out what we mean, rather than just doing what we say.

This makes me think about Inverse Reward Design, when the reward signal given is interpreted as an intention with the context of these specific training environments.

More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our sequence shows what that tower looks like for one particular model of Alice/Bob.

Makes me think of Common Knowledge, as defined for distributed computing: $ϕ$ is common knowledge iff everyone know that everyone knows that .... that everyone knows $ϕ$ . That probably only apply to the idealized case, but it might be another way to look at it.

Yeah, this is a pretty common technique at CHAI (relevant search terms: pragmatics, pedagogy, Gricean semantics). Some related work:

Showing versus Doing: Teaching by Demonstration (differences when you ask humans to teach vs. demonstrate a behavior)
Inverse Reward Design (interpret the stated reward function as a message, not a reward function)
Cooperative Inverse Reinforcement Learning (formalizing the interaction as a game)
Literal or Pedagogic Human? Analyzing Human Model Misspecification in Objective Learning (do you benefit from a pedagogic assumption? It turns out the literal assumption has robustness benefits, presumably because it doesn't rule out possibilities that it turns out the human does sometimes consider)
Preferences Implicit in the State of the World (interpret the state of the world as a message).

If it were, then one of our first messages would be (a mathematical version of) "the behavior I want is approximately reward-maximizing".

Yeah, I agree that if we had a space of messages that was expressive enough to encode this, then it would be fine to work in behavior space.

I am confused about the opening of your analysis:

In some sense, this idea solves basically none of the core problems of alignment. We still need a good-enough model of a human and a good-enough pointer to human values.

I'm curious if you disagree with:

The problem of choosing a basis of convergence is tractable (relative to alignment research in general)
The problem of verifying that AI is in the basis of convergence is tractable
Training an AI into a chosen basis of convergence could enforce that AI to be honest on the object level when object level honesty is available
Object level honesty is not a major result, for example because not enough important problems can be reduced to object level or because it is already achievable

Nice post! It was clear, and I agree that knowing more about the basin of attraction is useful. I also like that you caveat the usefulness of this idea yourself.

Communication priors suggest an approach to certain problems in AI alignment. Intuitively, rather than saying “I want X” and the AI taking that completely literally (as computers generally do), the AI instead updates on the fact that I said “I want X”, and tries to figure out what those words imply about what I actually want. It’s like pushing the “do what I mean” button - the AI would try to figure out what we mean, rather than just doing what we say.

This makes me think about Inverse Reward Design, when the reward signal given is interpreted as an intention with the context of these specific training environments.

More generally: each player’s optimal choices depends heavily on their model of the other player. Alice wants to act like Bob’s model of Alice, and Bob wants to act like Alice’s model of Bob. Then there’s the whole tower of Alice’s model of Bob’s model of Alice’s model of…. Our sequence shows what that tower looks like for one particular model of Alice/Bob.

47

Communication Prior as Alignment Strategy

47

Ω 19

Formalizing a Communication Prior

As a Fixed Point

Beyond Idealized Agents

What This Does And Does Not Do

47

Ω 19

47

Ω 19