How can we use techniques from formal verification to enable socially optimal outcomes in situations where individual incentives are misaligned with the social good?

Suppose that we have a good enough implementation of a strategic epistemology. That is, a way for our software system to build up a model of the world, using information from its own sensors and other systems, that accurately captures the relevant information necessary to make a decision on our behalf. In contexts with agents, or other software systems acting on their behalf, this model may need to include a representation of what those agents care about and what policies they and their delegated systems can implement.

Well, that probably took a lot of work, but now we can finally start applying techniques from game theory. Often, such as in the Prisoners' Dilemma, agents can agree on the socially optimal outcome, and yet have individual incentives which push them to a different outcome that they all agree is worse for everyone. The dilemma that they jointly face is how to achieve socially optimal outcomes anyway, without being taken advantage of.

Program Equilibria

One way to achieve robust cooperation is for each agent to legibly delegate their decision to a computer program, and then give each program the source code for all of the delegated systems. All delegated systems are then asked to implement a policy in the underlying game, which might just be making a binding commitment on behalf of their principal.

This is a new enriched game, where the space of strategies is the space of computer programs with the right type signature. And it turns out that there are stable equilibria of programs, which lead to joint policies being implemented that are not equilibria of the underlying game.

In particular for the Prisoners' Dilemma, there are programs which output Cooperate if and only if they can prove that the other program will output Cooperate. And perhaps more amazingly, those programs actually both output Cooperate! And in general these modal agents can be run in polynomial time! You can play with an implementation from that team right now!

Formal Verification

Formal verification is fantastic, and there are lots of great reasons to apply it to software. Formal verification is all about proving that software systems have certain properties. This is great if we want to prove that our software correctly implements a specification, or has desirable security properties. It's also great if we want to design software systems that condition their behavior on the properties of other systems.

Decision Theory and AI Safety

A system's decision theory is the way it maps from a strategic context to a policy it will implement in that context. That policy defines that system's method of signal processing, how it transforms a stream of inputs into a stream of outputs. Agents are a special type of signal processor that care.

There are lots of signal processors that are safe to manufacture and deploy en masse. Microphones and network routers don't pose an existential risk. It is specifically signal processors that want things, and are good at shaping their outputs to achieve them, that pose a risk if their values are misaligned with ours.

We can deploy systems that increase our ability to achieve our goals, without those systems themselves being goal-directed agents. We can implement a decision theory in software, without that software caring about anything at all. And this is the only safe way I know to write software that is arbitrarily good at solving problems.

Social Choice Theory and Logical Handshakes

In a strategic context, outcomes can depend on the policies implemented by multiple parties. A system's social choice theory is the way it ranks outcomes in terms of collective welfare. This is distinct from how an individual might rank those same outcomes in terms of their own welfare. And it is this tension that can lead to situations where agents agree that their welfare individually and collectively would be higher in one outcome, and yet have individual incentives that push them to a worse outcome. Agents can agree on the socially optimal joint policy, and yet find it individually optimal not to play their part in it.

When systems are legible to each other, in the sense of being able to reason accurately about each other, they can determine whether they share the same social choice theory. And if they do, they can each check whether the others will implement their part of the optimal joint policy.^{[1]} If so, each legibly implements their part of the optimal joint policy! Otherwise they do something else, like playing their part of a socially high-ranking Nash equilibrium. This logical handshake among all systems robustly leads to a socially optimal outcome, with all agents knowing they won't be taken advantage of.

Systems may also find that some, but not all, other relevant systems legibly share their social choice theory. A three-player game can be modeled as a 2x2x2 grid of outcomes, corresponding to the interference pattern of each of the three players having 2 policies each. The first two players A and B might find that they legibly implement the same social choice theory, but it's common knowledge that isn't true for the third player C. A and B can still perform a logical handshake and agree to coordinate on steering towards socially high-ranking outcomes. (Either within their coalition or generally, depending on their decision theory.) The strategic landscape can now be modeled as a 4x2 grid, essentially a 2-player game where the robust coordination between A and B allows them to effectively act as a single player, A×B, and implement any of the 4 joint policies available to their coalition. This all might or might not be legible to player C.

When all systems in a strategic context share the same social choice theory, and can effectively perform these logical handshakes, the whole joint policy space collapses down to a one-dimensional line, and decision theory reduces down to solving an optimization problem. That optimization problem might still be extremely difficult, since there are superexponentially many joint policies, and evaluating the social choice theory on even a single joint policy might involve evaluating the consequences of implementing it. But at least now agents can contribute processing power towards finding good solutions, instead of being stuck in a bad Nash equilibrium and wishing they could coordinate on Something Else Which Is Not That.

^{^}

There might be a unique socially optimal joint policy, or systems might legibly use the same tie-breaking rule.

How can we use techniques from formal verification to enable socially optimal outcomes in situations where individual incentives are misaligned with the social good?

Suppose that we have a good enough implementation of a strategic epistemology. That is, a way for our software system to build up a model of the world, using information from its own sensors and other systems, that accurately captures the relevant information necessary to make a decision on our behalf. In contexts with agents, or other software systems acting on their behalf, this model may need to include a representation of what those agents care about and what policies they and their delegated systems can implement.

Well, that probably took a lot of work, but now we can finally start applying techniques from game theory. Often, such as in the Prisoners' Dilemma, agents can agree on the socially optimal outcome, and yet have individual incentives which push them to a different outcome that they all agree is worse for everyone. The dilemma that they jointly face is how to achieve socially optimal outcomes anyway, without being taken advantage of.

## Program Equilibria

One way to achieve robust cooperation is for each agent to legibly delegate their decision to a computer program, and then give each program the source code for all of the delegated systems. All delegated systems are then asked to implement a policy in the underlying game, which might just be making a binding commitment on behalf of their principal.

This is a new enriched game, where the space of strategies is the space of computer programs with the right type signature. And it turns out that there are stable equilibria of programs, which lead to joint policies being implemented that are not equilibria of the underlying game.

In particular for the Prisoners' Dilemma, there are programs which output Cooperate if and only if they can prove that the other program will output Cooperate. And perhaps more amazingly, those programs actually both output Cooperate! And in general these modal agents can be run in polynomial time! You can play with an implementation from that team right now!

## Formal Verification

Formal verification is fantastic, and there are lots of great reasons to apply it to software. Formal verification is all about proving that software systems have certain properties. This is great if we want to prove that our software correctly implements a specification, or has desirable security properties. It's also great if we want to design software systems that condition their behavior on the properties of other systems.

## Decision Theory and AI Safety

A system's decision theory is the way it maps from a strategic context to a policy it will implement in that context. That policy defines that system's method of signal processing, how it transforms a stream of inputs into a stream of outputs. Agents are a special type of signal processor that care.

There are lots of signal processors that are safe to manufacture and deploy en masse. Microphones and network routers don't pose an existential risk. It is specifically signal processors that want things, and are good at shaping their outputs to achieve them, that pose a risk if their values are misaligned with ours.

We can deploy systems that increase our ability to achieve our goals, without those systems themselves being goal-directed agents. We can implement a decision theory in software, without that software caring about anything at all. And this is the only safe way I know to write software that is arbitrarily good at solving problems.

## Social Choice Theory and Logical Handshakes

In a strategic context, outcomes can depend on the policies implemented by multiple parties. A system's social choice theory is the way it ranks outcomes in terms of collective welfare. This is distinct from how an individual might rank those same outcomes in terms of their own welfare. And it is this tension that can lead to situations where agents agree that their welfare individually and collectively would be higher in one outcome, and yet have individual incentives that push them to a worse outcome. Agents can agree on the socially optimal joint policy, and yet find it individually optimal not to play their part in it.

When systems are legible to each other, in the sense of being able to reason accurately about each other, they can determine whether they share the same social choice theory. And if they do, they can each check whether the others will implement their part of the optimal joint policy.

^{[1]}If so, each legibly implements their part of the optimal joint policy! Otherwise they do something else, like playing their part of a socially high-ranking Nash equilibrium. This logical handshake among all systems robustly leads to a socially optimal outcome, with all agents knowing they won't be taken advantage of.Systems may also find that some, but not all, other relevant systems legibly share their social choice theory. A three-player game can be modeled as a 2x2x2 grid of outcomes, corresponding to the interference pattern of each of the three players having 2 policies each. The first two players A and B might find that they legibly implement the same social choice theory, but it's common knowledge that isn't true for the third player C. A and B can still perform a logical handshake and agree to coordinate on steering towards socially high-ranking outcomes. (Either within their coalition or generally, depending on their decision theory.) The strategic landscape can now be modeled as a 4x2 grid, essentially a 2-player game where the robust coordination between A and B allows them to effectively act as a single player, A×B, and implement any of the 4 joint policies available to their coalition. This all might or might not be legible to player C.

When all systems in a strategic context share the same social choice theory, and can effectively perform these logical handshakes, the whole joint policy space collapses down to a one-dimensional line, and decision theory reduces down to solving an optimization problem. That optimization problem might still be extremely difficult, since there are superexponentially many joint policies, and evaluating the social choice theory on even a single joint policy might involve evaluating the consequences of implementing it. But at least now agents can contribute processing power towards finding good solutions, instead of being stuck in a bad Nash equilibrium and wishing they could coordinate on Something Else Which Is Not That.

^{^}There might be a unique socially optimal joint policy, or systems might legibly use the same tie-breaking rule.