Repeated Play of Imperfect Newcomb's Paradox in Infra-Bayesian Physicalism
Yesterday, I watched the video series about Infra-Bayesianism. Link: Some of it went over my head, so I tried to come up with a thought experiment to test my current level of understanding: Imagine that you are playing a repeated game similar to Newcomb's paradox, except that 90% of the time you play against a perfect super-predictor and 10% of the time you play against someone who pretends to be a perfect super-predictor. * In Infra-Bayesianism, there is a convex probability distribution over environments. * In Infra-Bayesian Physicalism, there is a probability distribution over computational evidence, e.g. knowledge about mathematical abstractions. There is also a probability distribution over models of the physical world. It is difficult to predict an agent's actions based on these probability distributions. However, it is easier to use an example scenario, like Newcomb's paradox, to guess what kind of evidence that is significant, whether it is computational or physical, for the agent's output actions. By trying two-boxing from time to time, the agent can test whether it is playing against a perfect super-predictor on Newcomb's paradox, or against somebody who pretends to be a perfect super-predictor. However, this information in repeated plays is shared between the agent and the one who pretends to be a super-predictor. Imagine that boxing an AGI agent is for the agent like playing against a super-predictor, as the human operators might be able to deterministically examine its outputs. The output of the agent is shared knowledge in repeated plays. This means, if I pretend to be a super-predictor, I can tell whether the agent tries to "break out" by checking whether some output results in unexpected two-boxing on Newcomb-like benchmarks. Assume that the agent learns a binary signal "I should one-box" or "I should two-box" depending on whether it predicts the likelihood that it is playing against a super-predictor or somebody who pretends to be a sup
I think that most people imagine themselves as gatekeepers for controlling the world, when in reality the world is controlled by systems we don't fully understand. For example, I don't think people really know what "money" means in the real world. You can imagine yourself as a person who knows it, but can you verify that your understanding matches reality?
If we want to have a chance to control the future to some degree, then it means that humans need to take over the world. It is not sufficient to stop AGI from taking over the world, but we also need to ensure that the future of the world is aligned with human values. For now, it seems that stuff like "money" is running the show.