Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

Infra-Bayesianism sequence(Diffractor and Vanessa Kosoy) (summarized by Rohin): I have finally understood this sequence enough to write a summary about it, thanks to AXRP Episode 5. Think of this as a combined summary + highlight of the sequence and the podcast episode.

The central problem of embedded agency (AN #31) is that there is no clean separation between an agent and its environment: rather, the agent is embedded in its environment, and so when reasoning about the environment it is reasoning about an entity that is “bigger” than it (and in particular, an entity that contains it). We don’t have a good formalism that can account for this sort of reasoning. The standard Bayesian account requires the agent to have a space of precise hypotheses for the environment, but then the true hypothesis would also include a precise model of the agent itself, and it is usually not possible to have an agent contain a perfect model of itself.

A natural idea is to reduce the precision of hypotheses. Rather than requiring a hypothesis to assign a probability to every possible sequence of bits, we now allow the hypotheses to say “I have no clue about this aspect of this part of the environment, but I can assign probabilities to the rest of the environment”. The agent can then limit itself to hypotheses that don’t make predictions about the part of the environment that corresponds to the agent, but do make predictions about other parts of the environment.

Another way to think about it is that it allows you to start from the default of “I know nothing about the environment”, and then add in details that you do know to get an object that encodes the easily computable properties of the environment you can exploit, while not making any commitments about the rest of the environment.

Of course, so far this is just the idea of using Knightian uncertainty. The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty while still satisfying many properties we would like a decision procedure to satisfy. You can thus think of it as an extension of the standard Bayesian account of decision-making to the setting in which the agent cannot represent the true environment as a hypothesis over which it can reason.

Imagine that, instead of having a probability distribution over hypotheses, we instead have two “levels”: first are all the properties we have Knightian uncertainty over, and then are all the properties we can reason about. For example, imagine that the environment is an infinite sequence of bits and we want to say that all the even bits come from flips of a possibly biased coin, but we know nothing about the odd coin flips. Then, at the top level, we have a separate branch for each possible setting of the odd coin flips. At the second level, we have a separate branch for each possible bias of the coin. At the leaves, we have the hypothesis “the odd bits are as set by the top level, and the even bits are generated from coin flips with the bias set by the second level”.

(Yes, there are lots of infinite quantities in this example, so you couldn’t implement it the way I’m describing it here. An actual implementation would not represent the top level explicitly and would use computable functions to represent the bottom level. We’re not going to worry about this for now.)

If we were using orthodox Bayesianism, we would put a probability distribution over the top level, and a probability distribution over the bottom level. You could then multiply that out to get a single probability distribution over the hypotheses, which is why we don’t do this separation into two levels in orthodox Bayesianism. (Also, just to reiterate, the whole point is that we can’t put a probability distribution at the top level, since that implies e.g. making precise predictions about an environment that is bigger than you are.)

Infra-Bayesianism says, “what if we just… don't put a probability distribution over the top level?” Instead, we have a set of probability distributions over hypotheses, and Knightian uncertainty over which distribution in this set is the right one. A common suggestion for Knightian uncertainty is to do worst-case reasoning, so that’s what we’ll do at the top level. Lots of problems immediately crop up, but it turns out we can fix them.

First, let’s say your top level consists of two distributions over hypotheses, A and B. You then observe some evidence E, which A thought was 50% likely and B thought was 1% likely. Intuitively, you want to say that this makes A “more likely” relative to B than we previously thought. But how can you do this if you have Knightian uncertainty and are just planning to do worst-case reasoning over A and B? The solution here is to work with unnormalized probability distributions at the second level. Then, in the case above, we can just scale the “probabilities” in both A and B by the likelihood assigned to E. We don’t normalize A and B after doing this scaling.

But now what exactly do the numbers mean if we’re going to leave these distributions unnormalized? Regular probabilities only really make sense if they sum to 1. We can take a different view on what a “probability distribution” is -- instead of treating it as an object that tells you how likely various hypotheses are, treat it as an object that tells you how much we care about particular hypotheses. (See relatedposts (AN #95).) So scaling down the “probability” of a hypothesis just means that we care less about what that hypothesis “wants” us to do.

This would be enough if we were going to take an average over A and B to make our final decision. However, our plan is to do worst-case reasoning at the top level. This interacts horribly with our current proposal: when we scale hypotheses in A by 0.5 on average, and hypotheses in B by 0.01 on average, the minimization at the top level is going to place more weight on B, since B is now more likely to be the worst case. Surely this is wrong?

What’s happening here is that B gets most of its expected utility in worlds where we observe different evidence, but the worst-case reasoning at the top level doesn’t take this into account. Before we update, since B assigned 1% to E, the expected utility of B is given by 0.99 * expected utility given not-E + 0.01 * expected utility given E. After the update, the second part remains but the first part disappears, which makes the worst-case reasoning wonky. So what we do is we keep track of the first part as well and make sure that our worst-case reasoning takes it into account.

This gives us infradistributions: sets of (m, b) pairs, where m is an unnormalized probability distribution and b corresponds to “the value we would have gotten if we had seen different evidence”. When we observe some evidence E, the hypotheses within m are scaled by the likelihood they assign to E, and b is updated to include the value we would have gotten in the world where we saw anything other than E. Note that it is important to specify the utility function for this to make sense, as otherwise it is not clear how to update b. To compute utilities for decision-making, we do worst-case reasoning over the (m, b) pairs, where we use standard expected values within each m. We can prove that this update rule satisfies dynamic consistency: if initially you believe “if I see X, then I want to do Y”, then after seeing X, you believe “I want to do Y”.

So what can we do with infradistributions? Our original motivation was to talk about embedded agency, so a natural place to start is with decision-theory problems in which the environment contains a perfect predictor of the agent, such as in Newcomb’s problem. Unfortunately, we can’t immediately write this down with infradistributions because we have no way of (easily) formally representing “the environment perfectly predicts my actions”. One trick we can use is to consider hypotheses in which the environment just spits out some action, without the constraint that it must match the agent’s action. We then modify the utility function to give infinite utility when the prediction is incorrect. Since we do worst-case reasoning, the agent will effectively act as though this situation is impossible. With this trick, infra-Bayesianism performs similarly to UDT on a variety of challenging decision problems.

Rohin's opinion: This seems pretty cool, though I don’t understand it that well yet. While I don’t yet feel like I have a better philosophical understanding of embedded agency (or its subproblems), I do think this is significant progress along that path.

In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level -- I don’t really see anything that forces that to be the case. As far as I can tell, we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately). The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient -- risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.

I wonder whether the important thing is just that we don’t do expected value reasoning at the top level, and there are in fact a wide variety of other kinds of decision rules that we could use that could all work. If so, it seems interesting to characterize what makes some rules work while others don’t. I suspect that would be a more philosophically satisfying answer to “how should agents reason about environments that are bigger than them”.

1. Learn at all levels: We don’t just learn about uncertain values, we also learn how to learn values, and how to learn to learn values, etc. There is no perfect loss function that works at any level; we assume conservatively that Goodhart’s Law will always apply. In order to not have to give infinite feedback for the infinite levels, we need to share feedback between levels.

2. Learn to interpret feedback: Similarly, we conservatively assume that there is no perfect feedback; so rather than fixing a model for how to interpret feedback, we want feedback to be uncertain and reinterpretable.

3. Process-level feedback: Rather than having to justify all feedback in terms of the consequences of the agent’s actions, we should also be able to provide feedback on the way the agent is reasoning. Sometimes we’ll have to judge the entire chain of reasoning with whole-process feedback.

This post notes that we can motivate these desiderata from multiple different frames:

1. Outer alignment: The core problem of outer alignment is that any specified objective tends to be wrong. This applies at all levels, suggesting that we need to learn at all levels, and also learn to interpret feedback for the same reason. Process-level feedback is then needed because not all decisions can be justified based on consequences of actions.

2. Recovering from human error: Another view that we can take is that humans don’t always give the right feedback, and so we need to be robust to this. This motivates all the desiderata in the same way as for outer alignment.

3. Process-level feedback: We can instead view process-level feedback as central, since having agents doing the right type of reasoning (not just getting good outcomes) is crucial for inner alignment. In order to have something general (rather than identifying cases of bad reasoning one at a time), we could imagine learning a classifier that detects whether reasoning is good or not. However, then we don’t know whether the reasoning of the classifier is good or not. Once again, it seems we would like to learn at all levels.

4. Generalizing learning theory: In learning theory, we have a distribution over a set of hypotheses, which we update based on how well the hypotheses predict observations. Process-level feedback would allow us to provide feedback on an individual hypothesis, and this feedback could be uncertain. Reinterpretable feedback on the other hand can be thought of as part of a (future) theory of meta-learning.

ADVERSARIAL EXAMPLES

Avoiding textual adversarial examples(Noa Nabeshima) (summarized by Rohin): Last week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen commented to link me to this Twitter thread as well as this YouTube video in which better prompt engineering significantly reduces these textual adversarial examples, demonstrating that CLIP does "know" that it's looking at an apple with a piece of paper on it.

FIELD BUILDING

AI x-risk reduction: why I chose academia over industry(David Krueger) (summarized by Rohin): This post and its comments discuss considerations that impact whether new PhD graduates interested in reducing AI x-risk should work in academia or industry.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

## HIGHLIGHTS

Infra-Bayesianism sequence

(Diffractor and Vanessa Kosoy)(summarized by Rohin): I have finally understood this sequence enough to write a summary about it, thanks to AXRP Episode 5. Think of this as a combined summary + highlight of the sequence and the podcast episode.The central problem of embedded agency (AN #31) is that there is no clean separation between an agent and its environment: rather, the agent is

embeddedin its environment, and so when reasoning about the environment it is reasoning about an entity that is “bigger” than it (and in particular, an entity thatcontainsit). We don’t have a good formalism that can account for this sort of reasoning. The standard Bayesian account requires the agent to have a space of precise hypotheses for the environment, but then the true hypothesis would also include a precise model of the agent itself, and it is usually not possible to have an agent contain a perfect model of itself.A natural idea is to reduce the precision of hypotheses. Rather than requiring a hypothesis to assign a probability to every possible sequence of bits, we now allow the hypotheses to say “I have no clue about this aspect of this part of the environment, but I can assign probabilities to the rest of the environment”. The agent can then limit itself to hypotheses that don’t make predictions about the part of the environment that corresponds to the agent, but do make predictions about other parts of the environment.

Another way to think about it is that it allows you to start from the default of “I know nothing about the environment”, and then add in details that you do know to get an object that encodes the easily computable properties of the environment you can exploit, while not making any commitments about the rest of the environment.

Of course, so far this is just the idea of using Knightian uncertainty. The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty while still satisfying many properties we would like a decision procedure to satisfy. You can thus think of it as an extension of the standard Bayesian account of decision-making to the setting in which the agent cannot represent the true environment as a hypothesis over which it can reason.

Imagine that, instead of having a probability distribution over hypotheses, we instead have two “levels”: first are all the properties we have Knightian uncertainty over, and then are all the properties we can reason about. For example, imagine that the environment is an infinite sequence of bits and we want to say that all the even bits come from flips of a possibly biased coin, but we know nothing about the odd coin flips. Then, at the top level, we have a separate branch for each possible setting of the odd coin flips. At the second level, we have a separate branch for each possible bias of the coin. At the leaves, we have the hypothesis “the odd bits are as set by the top level, and the even bits are generated from coin flips with the bias set by the second level”.

(Yes, there are lots of infinite quantities in this example, so you couldn’t implement it the way I’m describing it here. An actual implementation would not represent the top level explicitly and would use computable functions to represent the bottom level. We’re not going to worry about this for now.)

If we were using orthodox Bayesianism, we would put a probability distribution over the top level, and a probability distribution over the bottom level. You could then multiply that out to get a single probability distribution over the hypotheses, which is why we don’t do this separation into two levels in orthodox Bayesianism. (Also, just to reiterate, the

whole pointis that we can’t put a probability distribution at the top level, since that implies e.g. making precise predictions about an environment that is bigger than you are.)Infra-Bayesianism says, “what if we just… don't put a probability distribution over the top level?” Instead, we have a set of probability distributions over hypotheses, and Knightian uncertainty over which distribution in this set is the right one. A common suggestion for Knightian uncertainty is to do

worst-casereasoning, so that’s what we’ll do at the top level. Lots of problems immediately crop up, but it turns out we can fix them.First, let’s say your top level consists of two distributions over hypotheses, A and B. You then observe some evidence E, which A thought was 50% likely and B thought was 1% likely. Intuitively, you want to say that this makes A “more likely” relative to B than we previously thought. But how can you do this if you have Knightian uncertainty and are just planning to do worst-case reasoning over A and B? The solution here is to work with

unnormalizedprobability distributions at the second level. Then, in the case above, we can just scale the “probabilities” in both A and B by the likelihood assigned to E. Wedon’tnormalize A and B after doing this scaling.But now what exactly do the numbers mean if we’re going to leave these distributions unnormalized? Regular probabilities only really make sense if they sum to 1. We can take a different view on what a “probability distribution” is -- instead of treating it as an object that tells you how

likelyvarious hypotheses are, treat it as an object that tells you how much wecareabout particular hypotheses. (See related posts (AN #95).) So scaling down the “probability” of a hypothesis just means that we care less about what that hypothesis “wants” us to do.This would be enough if we were going to take an average over A and B to make our final decision. However, our plan is to do worst-case reasoning at the top level. This interacts horribly with our current proposal: when we scale hypotheses in A by 0.5 on average, and hypotheses in B by 0.01 on average, the minimization at the top level is going to place

moreweight on B, since B is nowmorelikely to be the worst case. Surely this is wrong?What’s happening here is that B gets most of its expected utility in worlds where we observe different evidence, but the worst-case reasoning at the top level doesn’t take this into account. Before we update, since B assigned 1% to E, the expected utility of B is given by 0.99 * expected utility given not-E + 0.01 * expected utility given E. After the update, the second part remains but the first part disappears, which makes the worst-case reasoning wonky. So what we do is we keep track of the first part as well and make sure that our worst-case reasoning takes it into account.

This gives us

infradistributions: sets of (m, b) pairs, where m is an unnormalized probability distribution and b corresponds to “the value we would have gotten if we had seen different evidence”. When we observe some evidence E, the hypotheses within m are scaled by the likelihood they assign to E, and b is updated to include the value we would have gotten in the world where we saw anything other than E. Note that it is important to specify the utility function for this to make sense, as otherwise it is not clear how to update b. To compute utilities for decision-making, we do worst-case reasoning over the (m, b) pairs, where we use standard expected values within each m. We can prove that this update rule satisfiesdynamic consistency: if initially you believe “if I see X, then I want to do Y”, then after seeing X, you believe “I want to do Y”.So what can we do with infradistributions? Our original motivation was to talk about embedded agency, so a natural place to start is with decision-theory problems in which the environment contains a perfect predictor of the agent, such as in Newcomb’s problem. Unfortunately, we can’t immediately write this down with infradistributions because we have no way of (easily) formally representing “the environment perfectly predicts my actions”. One trick we can use is to consider hypotheses in which the environment just spits out some action, without the constraint that it must match the agent’s action. We then modify the utility function to give infinite utility when the prediction is incorrect. Since we do worst-case reasoning, the agent will effectively act as though this situation is impossible. With this trick, infra-Bayesianism performs similarly to UDT on a variety of challenging decision problems.

Read more:AXRP Episode 5 - Infra-BayesianismRohin's opinion:This seems pretty cool, though I don’t understand it that well yet. While I don’t yet feel like I have a better philosophical understanding of embedded agency (or its subproblems), I do think this is significant progress along that path.In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level -- I don’t really see anything that

forcesthat to be the case. As far as I can tell, we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately). The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient -- risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.I wonder whether the important thing is just that we don’t do expected value reasoning at the top level, and there are in fact a wide variety of other kinds of decision rules that we could use that could all work. If so, it seems interesting to characterize what makes some rules work while others don’t. I suspect that would be a more philosophically satisfying answer to “how should agents reason about environments that are bigger than them”.

## TECHNICAL AI ALIGNMENT

## LEARNING HUMAN INTENT

Four Motivations for Learning Normativity

(Abram Demski)(summarized by Rohin): We’ve previously seen (AN #133) desiderata for agents that learn normativity from humans: specifically, we would like such agents to:1.

Learn at all levels:We don’t just learn about uncertain values, we also learn how to learn values, and how to learn to learn values, etc. There isno perfect loss functionthat works at any level; we assume conservatively that Goodhart’s Law will always apply. In order to not have to give infinite feedback for the infinite levels, we need toshare feedback between levels.2.

Learn to interpret feedback:Similarly, we conservatively assume that there isno perfect feedback; so rather than fixing a model for how to interpret feedback, we want feedback to beuncertainandreinterpretable.3.

Process-level feedback:Rather than having to justify all feedback in terms of the consequences of the agent’s actions, we should also be able to provide feedback on the way the agent is reasoning. Sometimes we’ll have to judge the entire chain of reasoning withwhole-process feedback.This post notes that we can motivate these desiderata from multiple different frames:

1.

Outer alignment:The core problem of outer alignment is that any specified objective tends to be wrong. This applies at all levels, suggesting that we need tolearn at all levels, and alsolearn to interpret feedbackfor the same reason.Process-level feedbackis then needed because not all decisions can be justified based on consequences of actions.2.

Recovering from human error:Another view that we can take is that humans don’t always give the right feedback, and so we need to be robust to this. This motivates all the desiderata in the same way as for outer alignment.3.

Process-level feedback:We can instead view process-level feedback as central, since having agents doing the right type ofreasoning(not just getting good outcomes) is crucial for inner alignment. In order to have something general (rather than identifying cases of bad reasoning one at a time), we could imagine learning a classifier that detects whether reasoning is good or not. However, then we don’t know whether the reasoning of the classifier is good or not. Once again, it seems we would like tolearn at all levels.4.

Generalizing learning theory:In learning theory, we have a distribution over a set of hypotheses, which we update based on how well the hypotheses predict observations.Process-level feedbackwould allow us to provide feedback on an individual hypothesis, and this feedback could beuncertain.Reinterpretable feedbackon the other hand can be thought of as part of a (future) theory of meta-learning.## ADVERSARIAL EXAMPLES

Avoiding textual adversarial examples

(Noa Nabeshima)(summarized by Rohin): Last week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen commented to link me to this Twitter thread as well as this YouTube video in which better prompt engineering significantly reduces these textual adversarial examples, demonstrating that CLIP does "know" that it's looking at an apple with a piece of paper on it.## FIELD BUILDING

AI x-risk reduction: why I chose academia over industry

(David Krueger)(summarized by Rohin): This post and its comments discuss considerations that impact whether new PhD graduates interested in reducing AI x-risk should work in academia or industry.## MISCELLANEOUS (ALIGNMENT)

Intermittent Distillations #1

(Mark Xu)(summarized by Rohin): A post in the same style as this newsletter.Key Concepts in AI Safety

(Tim G. J. Rudner et al)(summarized by Rohin): This overview from CSET gives a brief introduction to AI safety using the specification, robustness, and assurance (SRA) framework (AN #26). Follow-up reports cover interpretability and adversarial examples / robustness. I don’t expect these to be novel to readers of this newsletter -- I include them in case anyone wants a brief overview, as well as to provide links to AI safety reports that will likely be read by government officials.## NEWS

Chinese translation of Human Compatible (summarized by Rohin): The Chinese translation of Human Compatible (AN #69) came out in October and the first chapter is here.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by

replying to this email.PODCAST

An audio podcast version of the

Alignment Newsletteris available. This podcast is an audio version of the newsletter, recorded by Robert Miles.