Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Open Problems in Cooperative AI (Allan Dafoe et al) (summarized by Flo): Cooperation can often lead to better outcomes for everyone involved and some progress in AI, like improved machine translation or smart contracts, can make cooperation easier. On the other hand, as AI agents will likely become more and more important, it is crucial they can cooperate with each other and with humans. The term cooperative AI combines these perspectives and refers to "AI research trying to help individuals, humans and machines, to find ways to improve their joint welfare". While previous AI research has focused on individual intelligence of artificial agents, more work on social intelligence and cooperative capabilities is needed. The paper provides an overview of research relevant to cooperative AI from a wide range of disciplines and aims to facilitate more interdisciplinary conversations about cooperative AI by providing a common framework and vocabulary.

Research on cooperative opportunities, situations in which gains from cooperation are possible, can differ along four major dimensions:

1) How much do interests overlap or conflict?

2) What kind of agents are involved? (Humans, machines, organizations)

3) What perspective is taken: the perspective of an individual trying to cooperate or a social planner trying to incentivize cooperation?

4) What is the scope and how interdisciplinary should research be?

Four types of capabilities can be crucial for cooperation:

1) Understanding: In cases where cooperation is optimal for all parties, with no incentive to deviate, but agents lack understanding of either the environment or one another, they may still fail to reach a cooperative equilibrium. For example, one agent might have false beliefs about the others' beliefs and preferences and thus their incentive for defection. This is particularly hard to get around, because (1) preferences might not be defined explicitly or might even be incoherent; (2) there might be incentives to misrepresent preferences; and (3) the recursive nature of beliefs about other agents' beliefs may be challenging to handle.

2) Communication: In other cases, like the Stag Hunt game, there are multiple equilibria but agents' incentives are mostly aligned. Still, a lack of communicative abilities or common ground to interpret each other's messages can lead to agents converging to a suboptimal equilibrium. This is complicated further by constraints like limited bandwidth, high latency, or compatibility with human forms of communication.

3) Commitment: While communication helps when incentives are mostly aligned, in games like Chicken where some equilibria clearly favour one agent, communication alone is insufficient as agents have incentive to lie. In some cases, these problems can be circumvented using costly signals. However, often some form of credible commitment device to ensure cooperation, or at least truth-telling, is needed. Such commitment devices can enable unconditional ("I won't defect") or conditional ("I won't defect if you won't") and unilateral or multilateral commitments that require multiple actors to consent but bind them all. While unilateral unconditional commitments are most accessible, other forms can be a lot more powerful enablers of cooperation. Mechanisms that could be useful for these more powerful commitments include reputation, delegation to a trusted third party and (smart) contracts.

4) Institutions: In some games like the Prisoner's Dilemma, defection is a dominant strategy for all players, even if they would all be better off with mutual cooperation. In such situations, changing the rules of the game to align incentives and facilitate cooperation can make everyone better off, for example by linking different games (as done with the Iterated Prisoner's Dilemma) or introducing institutions. Institutions can be decentralized and entail (sometimes implicit) enforcement by players (norms, conventions, trust and reputation) or involve a centralized enforcing authority. The study of centralized institutions can draw on the literature in social choice theory, fairness, and mechanism design.

Lastly, the authors list potential downsides that better cooperative capabilities could have: While they increase the welfare of the cooperating agents, this might be at the cost of other agents. For example, better cooperation between criminals would likely be bad for society. Similarly, cooperation can undermine prosocial competition at the expense of society, as seen in the example of cartels. On the other hand, a better understanding of the world and others' preferences makes it easier to threaten others efficiently and coercive capabilities greatly benefit from credible conditional commitments to carry out a threat. Furthermore, coercive capabilities might be important for stabilizing cooperation as in the case of punishing defectors. Lastly, powerful bad actors can use more efficient institutions to serve their own antisocial goals.

Flo's opinion: I believe that humanity could be a lot better off if we were better at cooperating instead of wasting resources on competition over positional goods, so I am excited about more work exploring how AI might help with facilitating cooperation. Given that current RL often requires specialized algorithms to learn cooperation, even when agents are trained jointly, it seems likely that failures to cooperate will become even more problematic if we start to deploy more and more RL agents without first making progress in cooperative AI. While it has been argued that a shift towards more attention to learning cooperation is bound to happen because of economic incentives, I am still glad to see more steps in that direction.

Rohin’s opinion: Amongst possible technical approaches that could help with AI governance, cooperative AI seems like a good contender (though there can be downsides, as the paper notes). I’d be especially excited to get more clarity on when and where cooperative skills are important, perhaps through “deployment stories” of AI systems in which cooperative skills play a critical role in ensuring good outcomes.



Eight claims about multi-agent AGI safety (Richard Ngo) (summarized by Rohin): This post clearly states eight claims about multiagent AGI safety, and provides brief arguments for each of them. Since the post is itself basically a summary, I won’t go into detail here.


Transparency and AGI safety (Jennifer Lin) (summarized by Rohin): This post identifies four different motivations for working on transparency:

1. By learning more about how current neural networks work, we can improve our forecasts for AI timelines.

2. It seems necessary for inner alignment. In particular, whatever AI development model you take, it seems likely that there will be some possibility of emergent misbehavior, and there doesn’t yet seem to be a way to rule that out except via transparency.

3. A good solution to transparency would be sufficient for safety, since we could at least notice when AI systems were misaligned, and then choose not to deploy them.

4. Even if AI will “go well by default”, there are still instrumental reasons for transparency, such as improving cause prioritization in EA (via point 1), and for making systems more capable and robust.

After reviewing work on circuits (AN #111), the post suggests a few directions for future research:

1. Investigating how modular neural networks tend to be,

2. Figuring out how to make transparency outputs more precise and less subjective,

3. Looking for circuits in other networks (i.e. not image classifiers), see e.g. RL vision (AN #128),

4. Figuring out how transparency fits into an end-to-end story for AI safety.


Imitative Generalisation (AKA 'Learning the Prior') (Beth Barnes) (summarized by Rohin): This post explains a simplified version of the scheme in Learning the prior (AN #109) with an image classification example.

A key issue for distributional shift is that neural nets assign significant “probability” to “crazy” hypotheses. Imagine that we want to train a neural net to classify dog breeds, and in our training dataset D all huskies are on snow, but on the test dataset D’ they may also be on grass. Then a neural net is perfectly happy with the hypothesis “if most of the bottom half of the image is white, then it is a husky”, whereas humans would see that as crazy and would much prefer the hypothesis “a husky is a large, fluffy, wolf-like dog”, even if they don’t know what a husky looks like.

Thus, we might say that the human “prior” over hypotheses is much better than the corresponding neural net “prior”. So, let’s optimize our model using the human prior instead. In particular, we search for a hypothesis such that 1) humans think the hypothesis is likely (high human prior), and 2) the hypothesis leads humans to make good predictions on the training dataset D. Once we have this hypothesis, we have humans make predictions using that hypothesis on the test distribution D’, and train a model to imitate these predictions. We can then use this model to predict for the rest of D’. Notably, this model is now being used in an iid way (i.e. no distribution shift).

A key challenge here is how to represent the hypotheses that we’re optimizing over -- they need to be amenable to ML-based optimization, but they also need to be interpretable to humans. A text-based hypothesis would likely be too cumbersome; it is possible that neural-net-based hypotheses could work if augmented by interpretability tools that let the humans understand the “knowledge” in the neural net (this is similar in spirit to Microscope AI (AN #72)).

For more details on the setup, see the full post, or my previous summary.

Learning Normativity: A Research Agenda (Abram Demski) (summarized by Rohin): To build aligned AI systems, we need to have our AI systems learn what to do from human feedback. However, it is unclear how to interpret that feedback: any particular piece of feedback could be wrong; economics provides many examples of stated preferences diverging from revealed preferences. Not only would we like our AI system to be uncertain about the interpretation about any particular piece of feedback, we would also like it to improve its process for interpreting human feedback. This would come from human feedback on the meta-level process by which the AI system learns. This gives us process-level feedback, where we make sure the AI system gets the right answers for the right reasons.

For example, perhaps initially we have an AI system that interprets human statements literally. Switching from this literal interpretation to a Gricean interpretation (where you also take into account the fact that the human chose to say this statement rather than other statements) is likely to yield improvements, and human feedback could help the AI system do this. (See also Gricean communication and meta-preferences, Communication Prior as Alignment Strategy, and multiple related CHAI papers.)

Of course, if we learn how to interpret human feedback, that too is going to be uncertain. We can fix this by “going meta” once again: learning how to learn to interpret human feedback. Iterating this process we get an infinite tower of “levels” of learning, and at every level we assume that feedback is not perfect and the loss function we are using is also not perfect.

In order for this to actually be feasible, we clearly need to share information across these various “levels” (or else it would take infinite time to learn across all of the levels). The AI system should not just learn to decrease the probability assigned to a single hypothesis, it should learn what kinds of hypotheses tend to be good or bad.

Rohin's opinion: See the opinion on the next summary.

Recursive Quantilizers II (Abram Demski) (summarized by Rohin): This post gives an example scheme inspired by the previous post. Like iterated amplification, it defines an ideal (analogous to HCH (AN #34)), and then an approximation to it that could be computed in practice.

Like HCH, we imagine a tree of systems that improve as we increase the depth of the tree. However, the nodes in the tree are question-answering (QA) systems, rather than direct questions. Given a few QA systems from a lower level, we construct a QA system at a higher level by asking one low-level QA system “what’s a good safe distribution over QA systems”, and a different low-level QA system “what’s a good metric that we can use to judge QA systems”. We then use quantilization (AN #48) to select better-performing QA systems, without optimizing too hard and falling prey to Goodhart’s Law. In the infinite limit, this should converge to a stable equilibrium.

By having the tree reason about what good safe distributions are, and what good metrics are, we are explicitly improving the way that the AI system learns to interpret feedback (this is what the “good metric” is meant to evaluate), thus meeting the desiderata from the previous post.

To implement this in practice, we do something similar to iterated amplification. Iterated amplification approximates depth-limited HCH by maintaining a model that can answer arbitrary questions (even though each node is a single question); similarly here we maintain a model that has a distribution over QA systems (even though each node is a single QA system). Then, to sample from the amplified distribution, we sample two QA systems from the current distribution, ask one for a good safe distribution and the other for a good metric, and use quantilization to sample a new QA system given these ingredients. We use distillation to turn this slow quantilization process into a fast neural net model.

Considering the problem of inaccessible information (AN #104), the hope is that, as we amplify the QA system, we will eventually be able to approve of some safe reasoning process about inaccessible information. If this doesn’t happen, then it seems that no human reasoning could approve of reasoning about that inaccessible information, so we have done as well as possible.

Rohin's opinion: On feedback types: It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback at arbitrarily high levels. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.

On interaction: I’ve previously argued (AN #41) that it is important to get feedback online from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.

On desiderata: The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is good enough: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up -- as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.

Often, just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect it is successful simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by modeling humans as pedagogic, or by incorporating a notion of similarity (AN #96).

So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment