An approach to the Agent Simulates Predictor problem

by AlexMennen1 min read9th Apr 2016No comments

5

Personal Blog
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
11 comments, sorted by Highlighting new comments since Today at 3:25 AM
New Comment

I think that the idea that contradictions should lead to infinite utility is probably something that doesn't work for real models of logical uncertainty. Instead we can do pseudorandomization. That said, there might be some other approach that I'm missing.

Maximizing is not EDT. In fact I believe it is the original formulation of UDT. The problems with EDT arise when you condition by indexical uncertainty. Instead, you should condition by logical uncertainty while fixing indexical uncertainty (see also this). I think that the correct decision theory has to look like evaluating some sort of expected values since the spirit of VNM theory should survive.

I think that the idea of gradually increasing the power of utility estimation corresponds precisely to going to higher metathreat levels, since the implementation of metathreats by optimal predictors involves pseudorandomizing each level while using more powerful predictors on the next level that are able to simulate the pseudorandom. This should also solve logical counterfactual mugging where the logical coin looks random on the lower levels allowing pre-committing to cooperative behavior on the higher levels on which the logical coin looks deterministic.

I think that the idea that contradictions should lead to infinite utility is probably something that doesn’t work for real models of logical uncertainty.

Why not?

In fact I believe it is the original formulation of UDT.

Huh, you're right. I even knew that at one point. Specifically, what I proposed was UDT 1 (in this write-up's terminology), and I concluded that what I described was not UDT because I found a bug in it that I thought UDT should get right (and which UDT 1.1 does get right).

Suppose we amend ASP to require the agent to output a full simulation of the predictor before saying "one box" or "two boxes" (or else the agent gets no payoff at all). Would that defeat UDT variants that depend on stopping the agent before it overthinks the problem?

(Or instead of requiring the the agent to output the simulation, we could use the entire simulation, in some canonical form, as a cryptographic key to unlock an encrypted description of the problem itself. Prior to decrypting the description, the agent doesn't even know what the rules are; the agent is told in advance only that that decryption will reveal the rules.)

For the simulation-output variant of ASP, let's say the agent's possible actions/outputs consist of all possible simulations Si (up to some specified length), concatenated with "one box" or "two boxes". To prove that any given action has utility greater than zero, the agent must prove that the associated simulation of the predictor is correct. Where does your algorithm have an opportunity to commit to one-boxing before completing the simulation, if it's not yet aware that any of its available actions has nonzero utility? (Or would that commitment require a further modification to the algorithm?)

For the simulation-as-key variant of ASP, what principle would instruct a (modified) UDT algorithm to redact some of the inferences it has already derived?

simulation-output: It would require a modification to the algorithm. I don't find this particularly alarming, though, since the algorithm was intended as a minimally-complex solution that behaves correctly for good reasons, not as a final, fully-general version. To do this, the agent would have to first (or at least, at some point soon enough for the predictor to simulate) look for ways to partition its output into pieces and consider choosing each piece separately. There would have to be some heuristic for deciding what partitionings of the output to consider and how much computational power to devote to each of them, and then which one actually gets chosen depends on which has the highest resulting utility you expect to get from them. Come to think of it, this might be trickier than I was thinking because you would run into self-trust issues if you need to prove that you will output the correct simulation of the predictor. This could be fixed by delegating the task of fully simulating the predictor to an easier-to-model subroutine, though that would require further modification to the algorithm.

Simulation-as-key: I don't have a good answer to that.

In the first problem, the agent could commit to one-boxing (through the mechanism I described in the link) and then finish simulating the predictor afterwards. Then the predictor would still be able to simulate the agent until it commits to one-boxing, and then prove that the agent will one-box no matter what it computes after that.

The second version of the problem seems more likely to cause problems, but it might work for the agent to restrict itself to not using the information it pre-computed for the purposes of modeling the predictor (even though it has to use that information for understanding the problem). If predictor is capable of verifying or assuming that the agent will correctly simulate it, it could skip the impossible step of fully simulating the agent fully simulating it, and just simulate the agent on the decrypted problem.

This is a reply to Alex's comment 792 but I'm placing it here since for some weird reason the website doesn't let me reply to 792

I think that the idea that contradictions should lead to infinite utility is probably something that doesn’t work for real models of logical uncertainty.

Why not?

So, I started writing an explanation why it doesn't work, tried to anticipate the loopholes you would point out in this explanation and ended up with the conclusion it actually does work :)

First, note that in logical uncertainty the boolean divide between "contradiction" and "consistency" is replaced by a continuum. Logical conditional expectations become less and less stable as the probability of the condition goes to zero (see this; for a generalization to probabilistic algorithms see this). What we can do in the spirit of your proposal is e.g. maximize for some small constant (in the optimal predictor formalism we probably want to be a function of that goes to 0 as goes to infinity).

The problem was that the self-referential nature of UDT requies optimal predictors for reflective systems and the construction I knew for the latter yielded probabilistic optimal predictors since it uses the Kakutani fixed point theorem and we need to form mixtures to apply it. With probabilistic optimal predictors things get hairy since the stability condition "" is replaced by the condition "lowest eigenvalue of ". There seems to be no way to stabilize this new condition. There are superficially appealing analogues that in the degenerate case reduce to choosing the action most unlikely in the normal distribution with mean and covariance . Unfortunately it doesn't work since there might be several actions with similar likelihoods that get chosen with different indexical probabilities consistently with the above mean and covariance. Indeed it would be impossible for it to work since in particular it would allow getting non-negligible logical variance of a quantity that depends on no parameters, which cannot be (since it is always possible to hardcode such a quantity).

However, recently I discovered reflective systems that are deterministic (and which seem the right thing to use for real agents because of independent reasons). For these systems the "naive" method works! This again caches out into some sort of pseudorandomization but this way the pseudorandomization arises naturally instead of having to insert an arbitrary pseudorandom function by hand. Moreover it looks like it solves some issues with making the formalism truly "updateless" (i.e. dealing correctly with scenarios similar to counterfactual mugging).

Very pleased with this development!

Nice!

Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.

And I'm not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor's (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor's axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?

Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.

Yes, thanks for the correction. I'd fix it, but I don't think it's possible to edit a pdf in google drive, and it't not worth re-uploading and posting a new link for a typo.

And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?

I don't have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making.

Ideally, we'd want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don't have a coherent way of specifying what that would even mean.

I think you mean "a spurious counterfactual where the conditional utilities view the agent one-boxing as evidence that the predictor’s axioms must be inconsistent"? That is, the agent correctly believes that predictor's axioms are likely to be consistent but also thinks that they would be inconsistent if it one-boxed, so it two-boxes?

[Edit: this isn't actually a spurious counterfactual.] The agent might reason "if I two-box, then either it's because I do something stupid (we can't rule this out for Lobian reasons, but we should be able to assign it arbitrarily low probability), or, much more likely, the predictor's reasoning is inconsistent. An inconsistent predictor would put $1M in box B no matter what my action is, so I can get $1,001,000 by two-boxing in this scenario. I am sufficiently confident in this model that my expected payoff conditional on me two-boxing is greater than $1M, whereas I can't possibly get more than $1M if I one-box. Therefore I should two-box." (this only happens if the predictor is implemented in such a way that it puts $1M in box B if it is inconsistent, of course). If the agent reasons this way, it would be wrong to trust itself with high probability, but we'd want the agent to be able to trust itself with high probability without being wrong.