rohinmshah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/

rohinmshah's Comments

Equilibrium and prior selection problems in multipolar deployment
I think there are theorems to be proven, just not of the form "there is an optimal thing to do"

I meant one thing and wrote another; I just meant to say that there weren't theorems in this post.

If the CAIS view multi-agent setups like this could be inevitable.

My point is just that "prior / equilibrium selection problem" is a subset of the "you don't know everything about the other player" problem, which I think you agree with?

It's also, to a first approximation, the strategy society takes in lots of situations, this happens whenever people form teams with a common goal. There are usually processes of re-negotiating the goal, but between these times of conflict people gain a lot of efficiency by working together and punishing deviation.

I'm not sure how this relates to the thing I'm saying (I'm also not sure if I understood it).

Equilibrium and prior selection problems in multipolar deployment
I’m not sure what you mean by “heuristic” or “optimality” here. I don’t know of any good notion of optimality which is independent of the other players, which is why there is an equilibrium selection problem.

I think once you settle on a "simple" welfare function, it is possible that there are _no_ Nash equilibria such that the agents are optimizing that welfare function (I don't even really know what it means to optimize the welfare function, given that you have to also punish the opponent, which isn't an action that is useful for the welfare function).

I’m not sure if you mean “there aren’t any theorems to be proven” or “any theorem that’s proven in this framework would be useless”.

Hmm, I meant one thing and wrote another. I meant to say "there aren't any theorems proven in this post".

Equilibrium and prior selection problems in multipolar deployment

Ah, I misunderstood your post. I thought you were arguing for problems conditional on the principals agreeing on the welfare function to be optimized, and having common knowledge that they were designing agents that optimize that welfare function.

but there's the additional point that in the case of principals designing AI agents, the principals can (in theory) coordinate to ensure that the agents "know who their partner is".

I mean, in this case you just deploy one agent instead of two. Even under the constraint that you must deploy two agents, you exactly coordinate their priors / which equilibria they fall into. To get prior / equilibrium selection problems, you necessarily need to have agents that don't know who their partner is. (Even if just one agent knows who the partner is, outcomes should be expected to be relatively good, though not optimal, e.g. if everything is deterministic, then threats are never executed.)

----

Looking at these objections, I think probably what you were imagining is a game where the principals have different terminal goals, but they coordinate by doing the following:

  • Agreeing upon a joint welfare function that is "fair" to the principals. In particular, this means that they agree that they are "licensed" to punish actions that deviate from this welfare function.
  • Going off and building their own agents that optimize the welfare function, but make sure to punish deviations (to ensure that the other principal doesn't build an agent that pursues the principal's goals instead of the welfare function)

New planned summary:

Consider the scenario in which two principals with different terminal goals will separately develop and deploy learning agents, that will then act on their behalf. Let us call this a _learning game_, in which the "players" are the principals, and the actions are the agents developed.
One strategy for this game is for the principals to first agree on a "fair" joint welfare function, such that they and their agents are then licensed to punish the other agent if they take actions that deviate from this welfare function. Ideally, this would lead to the agents jointly optimizing the welfare function (while being on the lookout for defection).
There still remain two coordination problems. First, there is an _equilibrium selection problem_: if the two deployed learning agents are Nash strategies from _different_ equilibria, payoffs can be arbitrarily bad. Second, there is a _prior selection problem_: given that there are many reasonable priors that the learning agents could have, if they end up with different priors from each other, outcomes can again be quite bad, especially in the context of <@threats@>(@Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda@).

New opinion:

These are indeed pretty hard problems in any non-competitive game. While this post takes the framing of considering optimal principals and/or agents (and so considers Bayesian strategies in which only the prior and choice of equilibrium are free variables), I prefer the framing taken in <@our paper@>(@Collaborating with Humans Requires Understanding Them@): the issue is primarily that the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arbitrarily poorly.
Note that when you can have a well-specified Bayesian belief over your partner, these problems don't arise. However, both agents can't be in this situation: in this case agent A would have a belief over B that has a belief over A; if these are all well-specified Bayesian beliefs, then A has a Bayesian belief over itself, which is usually impossible.

Btw, some reasons I prefer not using priors / equilibria and instead prefer just saying "you don't know who your partner is":

  • It encourages solutions that take advantage of optimality and won't actually work in the situations we actually face.
  • The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven). The one benefit I see is that it signals that "no, even if we formalize it, the problem doesn't go away", to those people who think that once formalized sufficiently all problems go away via the magic of Bayesian reasoning.
  • The strategy of agreeing on a joint welfare function is already a heuristic and isn't an optimal strategy; it feels very weird to suppose that initially a heuristic is used and then we suddenly switch to pure optimality.
Equilibrium and prior selection problems in multipolar deployment

Planned summary for the Alignment Newsletter:

Consider the scenario in which two principals will separately develop and deploy learning agents, that will then act on their behalf, and suppose further that they even agree on the welfare function that these agents should optimize. Let us call this a _learning game_, in which the "players" are the principals, the actions are the agents developed, and both players want to optimize the welfare function (making it a collaborative game). There still remain two coordination problems. First, we face an _equilibrium selection problem_: there can be multiple Nash equilibria in a collaborative game, and so if the two deployed learning agents are Nash strategies from _different_ equilibria, payoffs can be arbitrarily bad. Second, we face a _prior selection problem_: given that there are many reasonable priors that the learning agents could have, if they end up with different priors from each other, outcomes can again be quite bad, especially in the context of <@threats@>(@Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda@).

Planned opinion:

These are indeed pretty hard problems in any collaborative game. While this post takes the framing of considering optimal principals and/or agents (and so considers Bayesian strategies in which only the prior and choice of equilibrium are free variables), I prefer the framing taken in <@our paper@>(@Collaborating with Humans Requires Understanding Them@): the issue is primarily that in a collaborative game, the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arbitrarily poorly.
Note that when you can have a well-specified Bayesian belief over your partner, these problems don't arise. However, both agents can't be in this situation: in this case agent A would have a belief over B that has a belief over A; if these are all well-specified Bayesian beliefs, then A has a Bayesian belief over itself, which is impossible.
Openness Norms in AGI Development

Planned summary for the Alignment Newsletter:

This post summarizes two papers that provide models of why scientific research tends to be so open, and then applies it to the development of powerful AI systems. The first models science as a series of discoveries, in which the first academic group to reach a discovery gets all the credit for it. It shows that for a few different models of info-sharing, info-sharing helps everyone reach the discovery sooner, but doesn't change the probabilities for who makes the discovery first (called _race-clinching probabilities_): as a result, sharing all information is a better strategy than sharing none (and is easier to coordinate on than the possibly-better strategy of sharing just some information).
However, this theorem doesn't apply when info sharing compresses the discovery probabilities _unequally_ across actors: in this case, the race-clinching probabilities _do_ change, and the group whose probability would go down is instead incentivized to keep information secret (which then causes everyone else to keep their information secret). This could be good news: it suggests that actors are incentivized to share safety research (which probably doesn't affect race-clinching probabilities) while keeping capabilities research secret (thereby leading to longer timelines).
The second paper assumes that scientists are competing to complete a k-stage project, and whenever they publish, they get credit for all the stages they completed that were not yet published by anyone else. It also assumes that earlier stages have a higher credit-to-difficulty ratio (where difficulty can be different across scientists). It finds that under this setting scientists are incentivized to publish whenever possible. For AI development, this seems not to be too relevant: we should expect that with powerful AI systems, most of the "credit" (profit) comes from the last few stages, where it is possible to deploy the AI system to earn money.

Planned opinion:

I enjoyed this post a lot; the question of openness in AI research is an important one, that depends both on the scientific community and industry practice. The scientific community is extremely open, and the second paper especially seems to capture well the reason why. In contrast industry is often more secret (plausibly due to <@patents@>(@Who owns artificial intelligence? A preliminary analysis of corporate intellectual property strategies and why they matter@)). To the extent that we would like to change one community in the direction of the other, a good first step is to understand their incentives so that we can try to then change those incentives.
Rohin Shah on reasons for AI optimism
If so, people may have to stop citing you as an "optimist"

I wouldn't be surprised if the median number from MIRI researchers was around 50%. I think the people who cite me as an optimist are people with those background beliefs. I think even at 5% I'd fall on the pessimistic side at FHI (though certainly not the most pessimistic, e.g. Toby is more pessimistic than I am.

Rohin Shah on reasons for AI optimism

Suppose you have two events X and Y, such that X causes Y, that is, if not-X were true than not-Y would also be true.

Now suppose there's some Y' analogous to Y, and you make the argument A: "since Y happened, Y' is also likely to happen". If that's all you know, I agree that A is reasonable evidence that Y' is likely to happen. But if you then show that the analogous X' is not true, while X was true, I think argument A provides ~no evidence.

Example:

"It was raining yesterday, so it will probably rain today."

"But it was cloudy yesterday, and today it is sunny."

"Ah. In that case it probably won't rain."

I think condition 2 causes racing causes MAD strategies in the case of nuclear weapons; since condition 2 / racing doesn't hold in the case of AI, the fact that MAD strategies were used for nuclear weapons provides very little evidence about whether similar strategies will be used for AI.

MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons.

I agree with that sentence interpreted literally. But I think you can change "for strategic reasons" to "in cases where condition 2 holds" and still capture most of the cases in which this happens.

Rohin Shah on reasons for AI optimism
Are you thinking roughly that (a) returns diminish steeply from the current point, or (b) that effort will likely ramp up a lot in future and pluck a large quantity of the low hanging fruit that currently remain, such that even more ramping up would face steeply diminishing returns?

More like (b) than (a). In particular, I'm think of lots of additional effort by longtermists, which probably doesn't result in lots of additional effort by everyone else, which already means that we're scaling sublinearly. In addition, you should then expect diminishing marginal returns to more research, which lessens it even more.Also, a thing that I realized

Also, I was thinking about this recently, and I am pretty pessimistic about worlds with discontinuous takeoff, which should maybe add another ~5 percentage points to my risk estimate conditional on no intervention by longtermists, and ~4 percentage points to my unconditional risk estimate.

Load More