A lot of ink has been spilled about how to get various decision algorithms to cooperate with each other. However, most approaches require the algorithm to provide some kind of information about itself to a potential cooperator.

Consider FDT, the hot new kid on the block. In order for FDT to cooperate with you, it needs to reason about how your actions are related to the algorithm that the FDT agent is running. The naive way of providing this information is simple: give the other agent a copy of your “source code”. Assuming that said code is analyzable without running into halting problem issues, this should allow FDT to work out whether it wants to cooperate with you.

Security needs of decision algorithms

However, just handing out your source code to anyone isn’t a great idea. This is because agents often want to pretend to have a different decision algorithm to the one that they really do.

Threats and capitulation thresholds

Many people adopt some version of the “don’t give in to threats” heuristic. The reasoning here is solid: while refusing to give in to a threat carries a penalty in the particular instance, being the kind of person who doesn’t give in to threats pays off in the long run because you receive fewer threats.

However, usually this will come with an escape hatch in case the stakes get too high. If I am trying to extort £100 from you by plausibly threatening to destroy the planet, you should just pay the money. Why is this okay? Well, threats of higher magnitude are usually harder to create, or carry higher costs for the threatener, so an agent can expect to face relatively few of them.

The absolute best thing for an agent, though, would be to have the reputation of never giving in to threats, no matter how high the cost, while actually being willing to give in to threats above some threshold. That way you never get threatened, but if by some ill luck you do face an enormous threat, you can still capitulate. In general, an agent would like their capitulation threshold to be perceived to be as high as possible, ideally infinity.

It is therefore in an agent’s interest to hide their capitulation threshold. If an adversary finds out what it is, and it’s something they can exceed easily, then they can exploit you, even if they only get one shot at threatening you.

This makes an agent’s capitulation threshold a piece of sensitive data, and keeping it secret is a security problem. Obviously just handing our your source code is right out, but an agent should also take other security precautions. To give a short list:

  • Be cautious about talking about how you make decisions.
  • Try to conceal any evidence of occasions when you have capitulated in the past.
  • Avoid revealing information about your past behaviour in general, or at least follow glomarization rules.
  • Avoid interacting with agents who appear to be probing for sensitive information (e.g. escalating threats).1

All data is sensitive

I’ve been talking about a capitulation threshold because it’s an easy example, but the history of computer security tells us that leaking any kind of information about your implementation can be dangerous. It’s best to just assume that attackers have magical attack powers and hide as much as you can.

(Aside: what about FDT? Isn’t it immune to this kind of exploitation? Sure, but you’re not going to be running FDT, you’re going to be running some approximation to it. Imagine an agent running FDT-approx, which spends N cycles computing an approximation to FDT, then acts on that. The value of N is then sensitive data - knowing that could allow an adversary to construct a situation just complicated enough to lead to poor behaviour without being too costly to construct.)

But if we are operating in a context where revealing information about how we work is dangerous, how are we supposed to reveal enough about ourselves to be able to cooperate effectively?

Zero knowledge cooperation

We can borrow some ideas from security to help us here. Suppose that A wishes to convince B that they are likely to cooperate in an upcoming game. What A actually wants to convey to B is that precise fact: they are likely to cooperate. Ideally they would reveal no other knowledge at all. What we want is a zero-knowledge proof of A’s willingness to cooperate.

This seems like it could be a whole subfield of work, but here are a few suggestions.

Private histories

One way for A provide evidence that they are likely to cooperate is to provide historical evidence that they cooperated in similar situations in the past. However, this is sensitive information, so ideally they would like it not to be public.

So let’s suppose that after a game has played out, the results are made public, but identifying each agent with a unique token, for which they can provide a zero-knowledge proof of ownership.

Thus A can reveal that they were a participant in a previous game, providing evidence towards their future actions, but without providing this knowledge to anyone except B.

Secret source-code sharing

The most convincing evidence A could provide would be a copy of its source code.2 However, they don’t want B to end up with that, but only the derived information that it “looks trustworthy”.

To that end we can run the following protocol.3

First, A provides B with a copy of A’s source code, encrypted using A’s secret key and a fully homomorphic encryption scheme.4 Then B runs (homomorphically) a program (the Validator) on the source code that validates a property (in this case “trustworthiness”), and outputs 0 or 1 depending on whether the property holds. Crucially, the Validator also signs this output with B’s secret key.

Now what B actually receives is an encrypted version of the Validator’s output, so B has to return this to A for decryption. A can validate that the decrypted response matches an agreed upon pattern (is 0 or 1, not, say, the full source code), in which case they pass it back to B, along with the signature. Since A cannot forge B’s signature, B trusts that this really is the output of the Validator.

I think this is a zero-knowledge proof, since B cannot replicate the process for someone else without having A’s secret key, and a log of the transactions does not rule out B having provided A with the response beforehand. Informally, the “surprising” part of the protocol is A producing a token with B’s signature, but this is only surprising to B, and could be easily faked for a third part if A and B were colluding.


I think the situation looks surprisingly good here. A cautious agent should be able to get by with revealing the minimum amount of information about itself. The biggest problem is the currently prohibitive cost of FHE, but I’m optimistic that that will come down over time.

  1. Even if you have good security procedures you still want to do this. You should install fail2ban even if you think SSH is secure. 

  2. There is, of course, the small matter of proving that the source code you provided is the code you’re actually running, but this is independent of the security problem. 

  3. This protocol actually allows B to extract one bit of information of any kind from A’s secret data, so could be used for other purposes too. As far as I know it’s novel. 

  4. FHE is prohibitively expensive to perform at the moment, but we’re looking for a possibility proof here. Yes, I’m using a very large hammer to crack this nut. 

New Comment
7 comments, sorted by Click to highlight new comments since:

There's a detail I've not understood: what is that A passes back to B? The encrypted output or the decrypted output? If it's the decrypted output of the Validator, how does B verify that the signature is correct, since the Validator signed the enrcypted output?

If I understand you and FHE correctly: that's what the FHE is doing.

  1. A calculates msg_1 = Encrypt(A_source, A_key) to B

  2. B sends msg_2 = Sign(Validate(msg_1), B_key) to A

  3. A sends msg_3 = Decrypt(msg_2, A_key) to B. By the magic of FHE, msg_3 == Sign(Validate(A_source), B_key), even though A doesn't know B_key and B doesn't know A_source.

Pretty much! Expanding your explanation a little:

  1. A sends msg_1 = Encrypt(A_source, A_key), and sends that to B

  2. B wants to run Validate(source) = Sign(Check_trustworthy(source), B_key) on A_source, but can't do that directly because B only has an encrypted version.

    1. So B runs Validate under FHE on msg_1, producing msg_2 = Encrypt(Validate(A_source), A_key), and sends that to A.

  3. A decrypts msg_2, producing msg_3 = Validate(A_source) = Sign(Check_trustworthy(A_source), B_key), and sends that back to B (if it meets the agreed-on format).

  4. B has a claim that A's source is trustworthy, signed by B's key, which A can't have, so it must have been produced by B's program.

Step 2.1 is where the magic happens.

(I should have just put this in the post!)

I'm not sure I'm completely solid on how FHE works, so perhaps this won't work, but here's an idea of how B can exploit this approach:

  1. Let's imagine that Check_trustworthy(A_source) = 1. After step 3 of the parent comment B would know E1 = Encrypt(1, A_key). If Check_trustworthy(A_source) returned 0, B would instead know E0 = Encrypt(0, A_key) and the following steps works similarly. B knows which one it is by looking at msg_3.

  2. B has another program: Check_blackmail(X, source) that simulates behaviour of an agent with the given source code in situation X and returns 1 if it would be blackmailable or 0 if not.

  3. B knows Encrypt(A_source, A_key) and they can compute F(X) = Encrypt(Check_blackmail(X, A_source), A_key) for any X using FHE properties of the encryption scheme.

  4. Let's define W(X) = if(F(X) = E1, 1, 0). It's easy to see that W(X) = Check_blackmail(X, A_source), so now B can compute that for any X.

  5. Profit?

I think your example won't work, but it depends on the implementation of FHE. If there's a nonce involved (which there really should be), then you'll get different encrypted data for the output of the two programs you run, even though the underlying data is the same.

But you don't actually need to do that. The protocol lets B exfiltrate one bit of data, whatever bit they like. A doesn't get to validate the program that B runs, they can only validate the output. So any program that produces 0 or 1 will satisfy A and they'll even decrypt the output for you.

That does indeed mean that B can find out if A is blackmailable, or something, so exposing your source code is still risky. What would be really cool would be a way to let A also be sure what program has been run on their source by B, but I couldn't think of a way to do this such that both A and B are sure that the program was the one that actually got run.

A similar algorithm appears in Age of Em by Robin Hanson ('spur safes' in Chapter 14). Basically, a trusted third party allows copies of A and B to analyze each other's source code in a sealed environment, then deletes almost everything that is learned.

A and B both copy their source code into a trusted computing environment ('safe'), such as an isolated server or some variety of encrypted VM. The trusted environment instantiates a copy of A (A_fork) and gives it B_source to inspect. Similarly, B_fork is instantiated and allowed to examine A_source. There can be other inputs, such as some contextual information and a contract to discuss. They examine the code for several hours or so, but this is not risky to A or B because all information inside the trusted environment will mandatorily be deleted afterwards. The only outputs from the trusted environment are a secure channel from A_fork to A and one from B_fork to B. These may only ever output an extremely low-resolution one-time report. This can be one of the following 3 values: 'Enter into the contract with the other', 'Do not enter into the contract with the other', or 'Maybe enter the contract'.

This does require a trusted execution environment, of course.

I don't know if this idea is original to Hanson.

I haven't read Age of Em, but something like "spur safes" was an inspiration (I'm sure I've come across the idea before). My version is similar except that

  1. It's stripped down.

    1. B only needs to make a Validator, which could be a copy of themself, but doesn't have to be.

    2. It only validates A to B, rather than trying to do both simultaneously. You can of course just run it twice in both directions.

  2. You don't need a trusted computing environment.

I think that's a pretty big deal, because the trusted computing environment has to be trusted enough to run its end of A/B's secure channels. In order for A/B to trust the output, it would need to e.g. be signed by their private keys, but then the computing envionment has access to those keys and can do whatever it wants! The trick with FHE is to let B run a computation using their secret key "inside" the safe without letting anyone else see the key.