Status: outlining a reasonably interesting idea I haven't seen discussed here before. Just come up with it so might not be fully thought out. May be a repeat of something. Rough draft quality, grammar may be lacking.

Suppose you have some set of worlds and some set of actions $A$ . A policy is a function $P : W \to Δ A$ from worlds to probability distributions over actions. (worlds to actions is good too, but probability distributions are helpful for Secret keeping AI in particular)

You also have a prior probability distribution over worlds and a utility function $U : (W \times A) \to R$ . This gives a function from policies to reals. Given reasonable assumptions, we can pick the best policy from any set of allowable policies.

Now some of these policies are physically impossible, they require you to act differently based on information you don't yet know, or will never find out. Throw these policies out.

Let $Π$ be any set of policies formed by taking the intersection of the physically possible policies, with those that pass some other test. (I will show later a simple other test that yields meaningful and potentially useful behaviour) Our AI then takes the highest expected utility policy in $Π$ .

Note that all sorts of real world phenomena can be described in terms of physics constraints on policies. From communication and memory restrictions between different parts of the AI, to the AI running on a reversible or quantum computer that must implement a bijective or unitary function.

Assuming bounded utility functions, these agents have a nice property, they don't care much about worlds of small probability measure. If the agent hasn't seen extremely improbable input data, then extremely unlikely worlds won't influence the choice of actions, unless that choice was basically a tie anyway. (conjecture: If perturbing all the utilities by any perturbation of magnitude at most epsilon doesn't change the action on most worlds, then perturbing the utilities on a small measure of worlds doesn't change the action on most worlds)

Suppose that some section of the world can be selected with $s : W \to S$ , where $S$ is the set of options. (For example a computer password)

A secret keeping AI is one that only allows policies $P$ st for all $t \in S$

$E_{W} (U (w, P (w)) | s (w) = t)$ is constant.

In other words, there is no mutual information between the AI's actions, and the value of $s (w)$ . You can never learn anything whatsoever about the secret, if you start with the same priors over worlds as the AI, and all you ever observe is the AI's actions.

Use cases. Make the secret a password, and the AI can't use that password, even if it sees you typing it in. If your password is likely to be your dogs name, then the AI will act as if it doesn't know that either.

Make the secret a detailed understanding of network protocols, and you have an AI that can't hack. Make it the human genome, and that rules out a lot of biochemistry.

Make it an understanding of human psychology, and you have an AI that can't brainwash people.

Downsides

This is not stable under self modification. If the AI is given an opportunity to rewrite itself, and the secret isn't anything to do with programming, it will rewrite itself into a non-secret keeping AI that doesn't know any of the secrets yet, but can learn them.

A lot of work has gone into the choice of prior over world models.

If that prior contains totally strange and weird worlds, or doesn't contain perfectly sensible worlds, this can cause problems. Suppose the secret is the design of the latest superweapon. Except that most of the worlds in $W$ are flatlands, where physics is completely different, and the superweapon is based on entirely different physical principles. Then the AI can't show any knowledge of basic physics, because that gives info that we aren't in flatland, which is info about the superweapon design.

Suppose the secret is the 3 digit combination on a door lock. In every world in $W$ a burglar tries 0 to 500, stopping if they get in and giving up if the combination > 500.

The AI can't acknowledge the burglars presence, that would leak info about the combination. Of course, if you add worlds where the burglar starts at 999 and works down to 501, then that breaks the correlation between the burglars presence and the combination, so the AI can act as if it knows whether the burglar is there or not.

Use-cases that seem to avoid these downsides

Pick W to be some wide set of worlds, possibly all computable worlds or something.

Rescale the prior to make $s$ uniform over all values of $S$ . Generate a quantum random number for the value of $s$ in this world. Any part of the world you modify according to $s$ becomes semi-invisible to the AI, but keep that randomness in a vault, or just delete it and the AI acts like an expected utility maximiser.

Edit:

Another big downside, it can share the secret using an Xor trick. If the secret is a password, and it wants to tell you the password, this AI can flip a publicly visible coin, and then observe the results and write down (password xor coinflips). Of course, it can only do that if it never writes down the coinflips. In fact, a secret keeping AI can be turned into a normal AI by just adding a randomness source to the input, and an Xor to the output. So maybe a better condition would be that the output carries no info about the secret, even when conditioned on an arbitrary function of the input.

I will promote this post to the alignment forum if it gets a comment encouraging this.

I am unsure what the norms are, do they want anything vaguely AI related, or only the best stuff?