Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I think this expands on the points being made in the recently completed Garrabrant / Demski Embedded Agency sequence. It also serves to connect a paper I wrote recently that discusses mostly non-AI risks from multiple agents that expands on the work done last year of Goodhart's Law back to the deeper questions that MIRI is considering. Lastly, it tries to point out a bit of how all of this connects to some of the other streams of AI safety research.

Juggling Models

We don't know how to make agents contain a complete world model that includes themselves. That's a hard enough problem, but the problem could get much harder - and in some applications it already has. When multiple agents need to have world models, the discrepancy between the model and reality can have some nasty feedback effects that relate to Goodhart's law, which I am now referring to more generally as overoptimization failures.

In my recent paper, I discuss the problem when multiple agents interact, using poker as a motivating example. Each poker-playing agent needs to have a (simplified) model of the game in order to play (somewhat) optimally. Reasonable heuristics and Machine Learning already achieve super-human performance in "heads-up" (2-player) poker. But the general case of multi-player poker is a huge game, so the game gets simplified.

This is exactly the case where we can transition just a little bit from the world of easy decision theory, which Abram and Scott point out allows modeling "the agent and the environment as separate units which interact over time through clearly defined i/o channels," to the world of not embedded agents, but interacting agents. This moves just a little bit in the direction of "we don't know how to do this."

This partial transition happens because the agent must have some model of the decision process of the other players in order to play strategically. In that model, agents need to represent what those players will do not only in reaction to the cards, but in reaction to the bets the agent places. To do this optimally, they need a model of the other player's (perhaps implicit) model of the agent. And building models of other player's models seems very closely related to work like Andrew Critch's paper on Lob's Theorem and Cooperation.

That explains why I claim that building models of complex agents that have models of you that then need models of them, etc. is going to be related to some of the same issues that embedded agents face, even without the need to deal with some of the harder parts of self-knowledge of agents that self-modify.

Game theory "answers" this, but it cheated.

The obvious way to model interaction is with game theory, which makes a couple seemingly-innocuous simplifying assumptions. The problem is that these assumptions are impossible in practice.

The first is that the agents are rational and Bayesian. But as Chris Sims pointed out, there are no real Bayesians. (" Not that there’s something better out there. ")

• There are fewer than 2 truly Bayesian chess players (probably none). • We know the optimal form of the decision rule when two such players play each other: Either white resigns, black resigns, or they agree on a draw, all before the first move. • But picking which of these three is the right rule requires computations that are not yet complete.

This is (kind of) a point that Abram and Scott made in the sequence in disguise - that world models are always smaller than the agents.

The second assumption is that agents have common knowledge of both agents' objective functions. (Ben Pace points out how hard that assumption is to realize in practice. And yes, you can avoid this assumption by specifying that they have uncertainty of a defined form, but that just kicks the can down the road - how do you know what distributions to use? What happens if the agent's true utility is outside the hypothesis space?) If the models of the agents must be small, however, it is possible that they cannot have a complete model of the other agent's preferences.

It's a bit of a side-point for the embedded agents discussion, but breaking this second assumption is what allows for a series of overoptimization exploitations explored in the new paper. Some of these, like accidental steering and coordination failures, are worrying for AI-alignment because they pose challenges even for cooperating agents. Others, like adversarial misalignment, input spoofing and filtering, and goal co-option, are only in the adversarial case, but can still matter if we are concerned about subsystem alignment. And the last category, direct hacking, gets into many of the even harder problems of embedded agents.

Embedded agents, exploitation and ending.

As I just noted, one class of issues that embedded agents have that traditional dichotomous agents do not is direct interference. If an agents hacks the software another agent is running on, there are many obvious exploits to worry about. This can't easily happen with a defined channel. (But to digress, they still do happen in such defined channels. This is because people without security mindset keep building Turing-complete languages into the communication interfaces, instead of doing #LangSec properly.)

But for embedded agents the types of exploitation we need to worry about are even more general. Decision theory with embedded world models is obviously critical for Embedded Agency work, but I think it's also critical for value alignment, since "goal inference" in practice requires inferring some baseline shared human value system from incoherent groups. (Whether or not the individual agents are incoherent.) This is in many ways a multi-agent cooperation problem - and even if we want to cooperate and share goals, and we already agreed that we should do so, cooperation can fall prey to accidental steering and coordination failures.

Lastly, Paul Christiano's Iterated Amplification approach, which in part relies on small agents cooperating, seems to need to deal with this even more explicitly. But I'm still thinking about the connections between these problems and the ones his approach takes, and I'll wait for his sequence to be finished, and time for me to think about it, to comment about this and get more clarity.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 3:12 PM

It's great to see work on the multiagent setting! This setting does seem quite a bit more complex, and hasn't been explored very much to my knowledge.

One major question I have after reading this post and the associated paper is, how does this relate to the work already done in academia? Sure, game theory makes unrealistic assumptions, but these assumptions don't seem horribly wrong when applied to simplified settings when we do have a good model. For example, with overfishing, even if everyone knew exactly what would happen if they overfished, the problem would still arise (and it's not unreasonable to think that everyone could know exactly what would happen if they overfished). I know that in your setting, the agents don't know about what other agents are doing, which makes it different from a classic tragedy of the commons, but I'd be surprised if this hadn't been studied before.

Even if you dislike the assumptions of game theory, I'm sure political science, law, etc. have tackled these sorts of situations, and they aren't going to make the Bayesian/rational assumption unless it's a reasonable model. (Or at least, if it isn't a reasonable model, people will call them out on it and probably develop something better.)

To give my quick takes on how each of the failure modes are related to existing academic work: Accidental steering is novel to me (but I wouldn't be surprised if there has been work on it), coordination failures seem like a particular kind of (large scale) prisoner's dilemma, adversarial misalignment is a special case of the principal-agent problem, input spoofing and filtering and goal co-option seem like special cases of adversarial misalignment (and are related to ML security as you pointed out).

Yes, there is a ton of work on some of these in certain settings, and I'm familiar with some of it.

In fact, the connections are so manifold that I suspect it would be useful to lay out which if these connections seems useful, in another paper, if only to save other people time and energy trying to do the same and finding dead-ends. On reflection, however, I'm concerned about how big of a project this ends up becoming, and I am unsure how useful it would be to applied work in AI coordination.

Just as one rabbit hole to go down, there is a tremendous amount of work on cooperation, which spans several very different literatures. The most relevant work, to display my own obvious academic bias, seems to be from public policy and economics, and includes work on participatory decision making and cooperative models for managing resources. Next, you mentioned law - I know there is work on interest-based negotiation, where defining the goals clearly allows better solutions, as well as work on mediation. In business, there is work on team-building that touches on these points, as well as inter-group and inter-firm competition and cooperation, which touch on related work in economics. I know the work on principle agent problems, as well as game-theory applied to more realistic scenarios. (Game theorists I've spoken with have noted the fragility of solutions to very minor changes in the problem, which is why it's rarely applied.) There's work in evolutionary theory, as well as systems biology, that touches on some of these points. Social psychology, Anthropology, and Sociology all presumably have literatures on the topic as well, but I'm not at all familiar with them.

Agreed that finding all connections would be a big project, but I think that anyone who tries to build off your work will either have to do the literature search anyway (at least for the parts they want to work on), or will end up reinventing the wheel. Perhaps you could find one or two literature review papers for each topic and cite those? I would imagine that you could get most of the value by finding ~20 such papers, which while not an easy task for fields you aren't familiar with, should still be doable with tens of hours of effort, and hopefully those papers would be useful for your own thinking.