Commitment and credibility in multipolar AI scenarios

Nice post! A couple of quick comments:

"If interactions are repeated in an environment where the stakes get higher over time, most agents would prefer to be honest while the stakes are low, regardless of how they will act in a sufficiently high-stakes situation."

If honesty in low-stakes situations is very weak evidence of honesty in high-stakes situations, then it will become less common as an instrumental strategy, which makes it stronger evidence, until it reaches equilibrium.

More generally, I am pretty curious about how reputational effects work when you have a very wide range of minds. The actual content of the signal can be quite arbitrary - e.g. it's possible to imagine a world in which it's commonly understood that lying continually about small scales is intended as a signal of the intention to be honest about large scales. Once that convention is in place, then it could be self-perpetuating.

This is a slightly extreme example but the general point remains: actions taken as signalling can be highly arbitrary (see runaway sexual selection for example) when they're not underpinned by specific human mental traits (like the psychological difficulty of switching between honesty and lying).

"This holds especially because the higher the stakes get in a competition for expansion, the fewer future interactions one expects, as wiping out other players entirely becomes a possible outcome."

Seems plausible, but note that early iterated interactions allows participant to steer towards possibilities where important outcomes are decided by many small interactions rather than a few large interactions, making long-term honesty more viable.

"lending an expensive camera to one's sibling seems less risky than to a stranger simply because of the high likelihood of frequent future interactions"

This doesn't seem right; your sibling is by default more aligned and trustworthy.

"while the agents can't interpret each other or predict how well they would stick to commitments, a far more capable system (here, likely just a system with vastly more compute at its disposal) could do it for them."

Is it fair to describe this as creating a singleton?

[-]Daniel Kokotajlo5y50

What do you think about the following sort of interpretability?

You and I are neural nets. We give each other access to our code, so that we can simulate each other. However, we are only about human-level intelligence, so we can't really interpret each other--we can't look at the simulated brain and say "Ah yes, it intends to kill me later." So what we do instead is construct hypothetical scenarios and simulate each other being in those scenarios, to see what they'd do. E.g. I simulate you in a scenario in which you have an opportunity to betray me.

[-]alexlyzhov5y30

Super thoughtful post!

I get the feeling that I'm more optimistic about post-hoc interpretability approaches working well in the case of advanced AIs. I'm referring to the ability of an advanced AI in the form of a super large neural network-based agent to take another super large neural network-based agent and verify its commitment successfully. I think this is at least somewhat likely to work by default (i.e. scrutinizing advanced neural network-based AIs may be easier than obfuscating intentions). I also think this may potentially not require that much information about the training method and training data.

I thought before that this doesn't matter in practice because of possibility of self-modification and successor agents. But I now think that at least in some range of potential situations verifying the behavior of a neural network seems enough for credible commitment when an agent pre-commits to using this neural network e.g. via a blockchain.

Also, are you sure that the fact that people can't simulate nematodes fits well in this argument? I may well be mistaken but I thought that we do not really have neural network weights for nematodes, we only have the architecture. In this case it seems natural that we can't do forward passes.

Whether or not AI systems can credibly commit to humans is not discussed much in this post, though it is also an interesting question. ↩︎
There are some counterexamples to this definition that work against the spirit of commitment, though. Agents could still knowingly omit relevant contextual information they have about the world, or about processes they set in motion earlier which are now separate from the agent as such. On the other hand, requiring that no interesting contextual information be hidden when commitments are made seems impossibly strict, since even miniscule differences between the beliefs of two agents could turn out to be relevant in ways that are unpredictable or overly costly to map out. ↩︎
Even in Bayesian games, contracts conditional on the other players' contracts can possibly be formulated. This idea has been investigated by Peters and Szentzes (2012), but seems overly abstract to be relevant here. ↩︎
This point was raised by Richard Ngo during our conversations in 2019. ↩︎
It's not a given that elegance alone would lead to increased interpretability, of course. For example, even if there were fundamental patterns to intelligence that superhuman systems could discover and model themselves after, these more compact foundations could still possibly be implemented in any number of different ways, none of which might be uniquely efficient. ↩︎
Some minimal non-aggression principles could hopefully also be added in to prevent agents from using the commitment system for extortion. This would on average again be in the interests of participants, as extortion causes expected value loss in the bargaining environment. ↩︎
The system's predictability was apparently hampered by the inexplicable policy decision to not mention it much to outsiders, though. ↩︎
Again, of course, this would not constitute a genuinely multipolar scenario. ↩︎
This is pretty intuitive, as it's also how human commitment structures have been designed -- as a kidnapper, you could hardly hire a lawyer to write an enforceable contract that binds you to actually killing your hostages unless you get what you want. ↩︎
This would naturally not mean that you couldn't have internally committed to the threat regardless of whether your target listens to you or not, but this would at least be an unwise strategy. ↩︎

LESSWRONG
LW

LESSWRONG
LW

31

Commitment and credibility in multipolar AI scenarios

31

31

Introduction

Approaches to commitment between AI systems

Classical approaches: mutually transparent architectures

Centralized collaborative approaches

Decentralized collaborative approaches

Automated approaches

Strategic delegation

Iteration, punishment capacity, and other miscellaneous factors

Conclusions and further notes

Acknowledgements