Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Cooperative AI: machines must learn to find common ground (Allan Dafoe et al) (summarized by Rohin): This short piece argues that rather than building autonomous AI systems (which typically involves a non-social environment), we should instead work on building AI systems that are able to promote mutually beneficial joint action, that is, we should work on Cooperative AI (AN #133). This can be separated into three main categories:
1. AI-AI cooperation: Here, two AI systems must cooperate with each other. Think for example of games like Hanabi or Diplomacy.
2. AI-human cooperation: This setting involves an AI system that must understand and work with a human. Assistance games (AN #69) are a central example. When there are multiple humans, it becomes important for our AI system to understand norms and institutions as well.
3. Human-human cooperation: Here, AI systems are used to enhance cooperation between humans. For example, machine translation helps people who speak different languages cooperate with each other.
There is now a new nonprofit, the Cooperative AI Foundation, that supports research on these topics.
Read more: Import AI #248
Rohin's opinion: I think there are three main sources of impact of this agenda from an x-risk perspective:
1. If an AI system has better agent-agnostic cooperative intelligence, it should be less likely to fall into “traps” of multiagent situations, such as conflict, bargaining failures (AN #86), or commitment races (AN #63).
2. If an AI system has better human-specific cooperative intelligence, it should be easier to (a) align that system with a human principal and (b) have that system cooperate with other humans (besides its principal).
3. If an AI system promotes cooperation between other agents (including humans), that seems to help with a variety of other major global problems, such as biorisk, nuclear war, and climate change.
I agree that all three of these things, if achieved, would make the world better in expectation (though see here (AN #52) for some arguments against). I feel best about (3), because it seems to have the largest surface area for improvement. In fact, (1) and (2) are almost special cases of (3), in which the AI system improves cooperation by being one of the agents that is cooperating with others (presumably on behalf of some human principal who wouldn't have done as good a job).
I am more uncertain about how best to achieve the goals laid out in (1) - (3). The article promotes multiagent reinforcement learning (MARL) for goal (1), which I think is plausible but not obvious: I could imagine that it would be better to (say) pretrain a large language model and then finetune it to be cooperative using human judgments. Within the Cooperative AI paradigm, I’m most excited about figuring out the best research bets to make in order to achieve the three goals above. The authors are very interested in finding others to help with this.
TECHNICAL AI ALIGNMENT
HANDLING GROUPS OF AGENTS
CLR's recent work on multi-agent systems (Jesse Clifton) (summarized by Rohin): This post summarizes recent work by the Center for Long-Term Risk (CLR). The general theme is cooperative AI, with a focus on research that helps avert s-risks. See this research agenda (AN #86) for an explanation of what might cause these s-risks. I’ve summarized some of the individual pieces of research below.
Weak identifiability and its consequences in strategic settings (Jesse Clifton) (summarized by Rohin): Inverse reinforcement learning suffers from the problem of unidentifiability: even given large amounts of data, it is not possible to uniquely recover the true reward function. This can lead to poor predictions if assumptions change (e.g. if there is some distributional shift, or if you are trying to correct for some bias like hyperbolic discounting).
This post demonstrates how a similar failure can affect multiagent settings as well, using the ultimatum game as an example. In the ultimatum game, there are two players: the Proposer and the Responder. The Proposer suggests a way to split $10 between the two players, and the Responder decides either to accept or reject the offer. If the offer is accepted, then the players get money according to the proposed split. If the offer is rejected, neither player gets anything.
Let’s suppose we get to observe how a particular Responder plays in an iterated ultimatum game, where we see as much data as we want. We figure out that the Responder will reject any split where it gets under $4. We could posit two explanations for this behavior:
1. Reputation-building: The Responder is building a reputation of refusing unfair splits (defined as a split where it gets < $4), so that it is offered better splits in the future.
2. Commitment: The Responder may have committed in advance to always refuse unfair splits (for the same reason, or perhaps because the Responder intrinsically dislikes unfair deals).
Note that both explanations perfectly account for all the data (no matter how much data we get).
Suppose the Responder has committed to rejecting unfair deals, but we incorrectly believe that it does it for reputation-building. Let’s say we now play a one-shot ultimatum game with the Responder. We reason that it no longer needs to build reputation, and so it will accept a 9/1 split. However, in fact it has committed to avoid unfair splits, and so rejects our offer. The post also gives a mathematical formalization of this example.
Collaborative game specification: arriving at common models in bargaining (Jesse Clifton) (summarized by Rohin): A major challenge in cooperative AI is when agents aren’t even playing the same game: perhaps they have very different priors (AN #94), or before communication is established they make conflicting commitments (AN #63), or their opponent’s strategy is unidentifiable (previous summary). This sort of misspecification can lead to arbitrarily bad outcomes.
This post proposes a simple solution called collaborative game specification (CGS). In CGS, we simply add an initial phase in which the agents talk to each other and determine a shared model of the game being played. The agents then act according to the equilibrium in that game. (Choosing an appropriate equilibrium notion can be part of the talking phase.)
There is of course an incentive for each agent to lie about their model of the game, in order to get an outcome more favorable to them. In order to combat this incentive, agents must also refuse to continue with CGS if the other player’s model is too different from their own (which is some evidence that they are lying in order to get a better outcome for themselves).
Pitfalls of the agent model (Alex Flint) (summarized by Rohin): It is common to view AI systems through the “agent lens”, in which the AI system implements a fixed, unchanging policy that, given some observations, takes some actions. This post points out several ways in which this “fixed, unchanging policy” assumption can lead us astray.
For example, AI designers may assume that the AI systems they build must have unchanging decision algorithms, and therefore believe that there will be a specific point at which influence is “handed off” to the AI system, before which we have to solve a wide array of philosophical and technical problems.
International Control of Powerful Technology: Lessons from the Baruch Plan for Nuclear Weapons (Waqar Zaidi et al) (summarized by Flo): This paper explores the analogy between early attempts at the international control of nuclear technology and the international control of today's emerging powerful technologies such as AI. While nuclear technology was perceived as very powerful, and many considered it an existential risk, there was also substantial uncertainty about its impacts. In addition, nuclear technology relied on rapid scientific development and engendered national competition, negotiation, and arms race dynamics. Lastly, there was a lot of policy debate about the political and ethical aspects of nuclear technology, including discussions on international governance ("international control").
The authors provide ten lessons from the history of nuclear control and support them with evidence from various case studies:
- Radical proposals might be discussed seriously or even adapted as official policy in the light of very disruptive technologies and upheavals in international politics.
- Actors' support or opposition to international control can be influenced by opportunism and their position might shift over time. Thus, it is particularly important to build broad coalitions.
- In particular, schemes for international governance are sometimes supported by "realists" focussed on power politics, but such support is often fickle.
- Secrecy around powerful technologies play an important role and can be abused by actors controlling information flow within a country. Secrecy should only be expanded with care and policymakers need to ensure they are informed by a wide range of perspectives.
- Public opinion has an important effect on debates about international control, and elites benefit from trying to shape it.
- Technical experts can influence policy to be more effective and cooperative, but they need to understand the political landscape.
- Policymaking often does not involve grand strategy; instead, it can better be described as muddling through, even in the realm of international control.
- International control is difficult, and it is unclear whether strategic obstacles can be circumvented.
- International cooperation can require countries to take substantial risk. It is important for international cooperation advocates to understand these risks and point out avenues for mitigation.
- Even maximally destructive solutions like preventive strikes can get political traction.
However, there are also important differences between nuclear technology and AI or other emerging technologies that have to be kept in mind: First, AI involves less secrecy and relies on the private sector more strongly. In addition, AI is already used around the world, and its use is harder to detect, such that proliferation might be harder to prevent. Lastly, the dangers of AI are less obvious, accidents might be harder to manage, and the strategic advantage from advanced AI might not plateau, as it has for nuclear weapons once second-strike capabilities were achieved. More broadly, the historical context regarding nuclear technology was influenced by WW2, the visceral examples of Hiroshima and Nagasaki, as well as a less globalized world with stronger zero-sum dynamics between superpowers.
Flo's opinion: I enjoyed reading this paper and the case studies provided ample context and details to further flesh out the lessons. While the paper might be a bit broad and not entirely AI-specific, I do recommend reading it if you are interested in international cooperation around AI.
OTHER PROGRESS IN AI
Muppet: Massive Multi-task Representations with Pre-Finetuning (Armen Aghajanyan et al) (summarized by Rohin): This paper proposes pre-finetuning: given a language model pretrained on a large dataset, we do a second stage where we train the model to solve a large variety of tasks (around 50 in this paper), and only after that do we finetune the model on our actual task of interest. The authors show that this leads to improved results, especially on tasks where we only have limited data.
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.