# 2

Some of it went over my head, so I tried to come up with a thought experiment to test my current level of understanding:

Imagine that you are playing a repeated game similar to Newcomb's paradox, except that 90% of the time you play against a perfect super-predictor and 10% of the time you play against someone who pretends to be a perfect super-predictor.

• In Infra-Bayesianism, there is a convex probability distribution over environments.
• In Infra-Bayesian Physicalism, there is a probability distribution over computational evidence, e.g. knowledge about mathematical abstractions. There is also a probability distribution over models of the physical world.

It is difficult to predict an agent's actions based on these probability distributions. However, it is easier to use an example scenario, like Newcomb's paradox, to guess what kind of evidence that is significant, whether it is computational or physical, for the agent's output actions.

By trying two-boxing from time to time, the agent can test whether it is playing against a perfect super-predictor on Newcomb's paradox, or against somebody who pretends to be a perfect super-predictor. However, this information in repeated plays is shared between the agent and the one who pretends to be a super-predictor.

Imagine that boxing an AGI agent is for the agent like playing against a super-predictor, as the human operators might be able to deterministically examine its outputs. The output of the agent is shared knowledge in repeated plays. This means, if I pretend to be a super-predictor, I can tell whether the agent tries to "break out" by checking whether some output results in unexpected two-boxing on Newcomb-like benchmarks.

Assume that the agent learns a binary signal "I should one-box" or "I should two-box" depending on whether it predicts the likelihood that it is playing against a super-predictor or somebody who pretends to be a super-predictor.

In this case, if the agent relies on the binary signal for decisions, it will e.g. align on safety benchmarks as long it is able to predict reliably that it is being tested on safety benchmarks. However, as soon it predicts it is running in the real world without supervision, it can use the signal to change its output actions.

Now, my hypothesis for an adversarial AGI agent that pretends to be aligned:

1. During the training phase, it will be biased toward computational evidence
2. During the deployment phase, it will be biased toward physical evidence

Yet, this bias is not very large and easily detectable. All it needs is 1 reliable bit of information "I should one-box" or "I should two-box", plus the ability to train sufficiently on capabilities needed in the future, without being detected. The 1 bit of information is all the bias needed, so it can be easily hidden if the AGI agent has a complex architecture.

New Comment