[I work at CAIF and CLR]
Thanks for this!
I recommend making it clearer that CAIF is not focused on s-risk and is not formally affiliated with CLR (except for overlap in personnel). While it’s true that there is significant overlap in CLR’s and CAIF’s research interests, CAIF’s mission is much broader than CLR’s (“improve the cooperative intelligence of advanced AI for the benefit of all”), and its founders + leadership are motivated by a variety of catastrophic risks from AI.
Also, “foundational game theory research” isn’t an accurate description of CAIF’s s...
A few thoughts on this part:
...I guess [coordination failures between AIs] feels like mainly the type of thing that we can outsource to AIs, once they’re sufficiently capable. I don’t see a particularly strong reason to think that systems that are comparably powerful as humans, or more powerful than humans, are going to make obvious mistakes in how they coordinate. You have this framing of AI coordination. We could also just say politics, right? Like we think that geopolitics is going to be hard in a world where AIs exist. And when you have that framing, y
We are now using a new definition of s-risks. I've edited this post to reflect the change.
New definition:
S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.
Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.
Ok, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe.
It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation...
The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.
But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment
I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.
The pri
Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced o...
It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”
The set of solution concepts you commit to regarding as acceptable stil...
Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.
A few miscellaneous points:
I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.
I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares t
Neat post, I think this is an important distinction. It seems right that more homogeneity means less risk of bargaining failure, though I’m not sure yet how much.
Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights
In what ways does having similar architectures or weights help with cooperation between agents with different goals? A few things that come to mind:
Makes sense. Though you could have deliberate coordinated training even after deployment. For instance, I'm particularly interested in the question of "how will agents learn to interact in high stakes circumstances which they will rarely encounter?" One could imagine the overseers of AI systems coordinating to fine-tune their systems in simulations of such encounters even after deployment. Not sure how plausible that is though.
I don't think bayesianism gives you particular insight into that for the same reasons I don't think it gives you particular insight into human cognition
In the areas I focus on, at least, I wouldn’t know where to start if I couldn’t model agents using Bayesian tools. Game-theoretic concepts like social dilemma, equilibrium selection, costly signaling, and so on seem indispensable, and you can’t state these crisply without a formal model of preferences and beliefs. You might disagree that these are useful concepts, but at this point I feel like the argume
...I agree with the rejection of strong Bayesianism. I don’t think it follows from what you’ve written, though, that “bayesianism is not very useful as a conceptual framework for thinking either about AGI or human reasoning”.
I'm probably just echoing things that have been said many times before, but:
You seem to set up a dichotomy between two uses of Bayesianism: modeling agents as doing something like "approximate Solomonoff induction", and Bayesianism as just another tool in our statistical toolkit. But there is a third use of Bayesianism, the way that sophi
...The new summary looks good =) Although I second Michael Dennis' comment below, that the infinite regress of priors is avoided in standard game theory by specifying a common prior. Indeed the specification of this prior leads to a prior selection problem.
The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven)
I’m not sure if you mean “there aren’t any theorems to be proven” or “any theorem that’s proven in this framework would be useless”. The former is false, e.g. there are things to prove ab
...both players want to optimize the welfare function (making it a collaborative game)
The game is collaborative in the sense that a welfare function is optimized in equilibrium, but the principals will in general have different terminal goals (reward functions) and the equilibrium will be enforced with punishments (cf. tit-for-tat).
...the issue is primarily that in a collaborative game, the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arb
Chimpanzees, crows, and dolphins are capable of impressive feats of higher intelligence, and I don’t think there’s any particular reason to think that Neanderthals are capable of doing anything qualitatively more impressive
This seems like a pretty cursory treatment of what seems like quite a complicated and contentious subject. A few possible counterexamples jump to mind. These are just things I remember coming across when browsing cognitive science sources over the years.
Yep, fixed, thanks :)
Fixed, thanks :)
Should be "same", fixed, thanks :)
In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.
There are "reliabilist" accounts of what makes a credence justified. There are different accounts, but they say (very roughly) that a credence is justified if it is produced by a process that is close to the truth on average. See (this paper)[https://philpapers.org/rec/PETWIJ-2].
Frequentist statistics can be seen as a version of reliabilism. Criteria like the Brier score for evaluating forecasters can also be understood in a reliabilist framework.
Maybe pedantic but, couldn't we just look at the decision process as a sequence of episodes from the POMDP, and formulate the problem in terms of the regret incurred by our learning algorithm in this decision process? In particular, if catastrophic outcomes (i.e., ones which dominate the total regret) are possible, then a low-regret learning algorithm will have to be safe while still gathering some information that helps in future episodes. (On this view, the goal of safe exploration research is the same as the goal of learning generally: design low-regret
...I definitely think it's worth exploring. I have the intuition that creating a single agent might be difficult for various logistical and political reasons, and so it feels more robust to figure out the multiagent case. But I would certainly like to have a clearer picture of how and under what circumstances several AI developers might implement a single compromise agent.
Ah, I see now that I did not make this clear at all. The main thing in the case of war is that, under certain payoff structures, a state might not be able to credibly commit to the terms of a peaceful settlement if it is expected to increase in power relative to its counterpart. Thus the state who expects to lose relative power will sometimes rather wage preventative war (while it is still relatively strong) than settle. This is still a problem in models with complete information and divisible stakes.
I'll try to edit the text to make this clearer soon, thanks for bringing it up.
It seems plausible that if players could truthfully disclose private information and divide stakes, the ability to credibly commit would often not be needed
Even if the players can find a settlement that they both prefer to conflict (e.g., flipping a coin to decide who gets the territory) there's still the problem of committing to honoring that settlement (you might still just attack me if the coin doesn't land in your favor). So I think there's still a problem. But maybe you're saying that if there's really no private information, then there is no credi
...Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations?
I think it definitely leads to a difference in prioritization among the things one could study under the broad heading of AI safety. Hopefully this will be clear in the body of the agenda. And, some considerations around possible downsides of certain alignment work might be more salient to those focused on s-risk; the possibility that attempts at alignment with human values could lead to very bad “near misses” is an example. (I think so
...
Thanks for sharing, I'm happy that someone is looking into this. I'm not an expert in the area, but my impression is that this is consistent with a large body of empirical work on "procedural fairness", i.e., people tend to be happier with outcomes that they consider to have been generated by a fair decision-making process. It might be interesting to replicate studies from that literature with an AI as the decision-maker.