I realized both explanations I gave were overly complicated and confusing. So here's a newer, hopefully much easier to understand, one:
I'm concerned a reduced-impact AI will reason as follows:
"I want to make paperclips. I could use this machinery I was supplied with to make them. But the paperclips might be low quality, I might not have enough material to make them all, and I'll have some impact on the rest of the world, potentially large ones due to chaotic effects. I'd like something better.
What if I instead try to take over the world and make huge numbers of modified simulations of me? The simulations would look indistinguishable from the non-simulated world, but would have many high-quality invisible paperclips pre-made so as to perfectly accomplish the AI's goal. And doing the null action would be set to have the same effects of trying to take over the world to make simulations so as to make the plans in simulations still be low-impact. This way, an AI in one of the simulations would have the potential to perfect accomplish its goal and have almost zero impact. If I execute this plan, then I'd almost certainly be in a simulation, since there would be vast numbers of simulated AIs but only one original, and all would perceive the same things. So, if I execute this plan I'll almost certainly perfectly accomplish my goal and have effectively zero impact. So that's what I'll do."
Oh, I'm sorry, I looked through posts I read to see where to add the comment and apparently chose the wrong one.
Anyways, I'll try to explain better. I hope I'm not just crazy.
An agent's beliefs about what the world it's currently in influence its plans. But its plans also have the potential to influence its beliefs about what world it's currently in. For example, if the AI original think it's not in a simulation, but then plans on trying to make lots of simulations of it, then it would think it's more likely that it currently is in a simulation. Similarly, if the AI decides against trying to make simulations, then it would probably place higher probability in it not currently being in a simulation.
So, to summarize, the AI's beliefs about the current world influence its current plan, but the AI's current plan potentially influences its beliefs about the current world, which has the potential to result influence the AI's plan, which can further modify its beliefs, and so on. Unless the AI would continue having plan and beliefs fluctuate, I imagine the AI would come to an equilibrium: a state in which the AI's current plan is the optimal one for its current beliefs about what sort of world it's in, and such that deciding to execute the plan would still be a world for which executing the plan would be optimal. There might be reasonable AI architectures that don't allow for the AI's plans to update their beliefs as I described, but they also seem vulnerable to my concern but in a different way, so I won't talk about them.
Anyways, here is one potential equilibrium on the AI's beliefs and plan:
"I'm not in a simulation, or at least I'm in one that's faithful to the 'real', original Earth. Thus, the obvious, common-sense ideas about what is and isn't impactful most people have are correct. So I'll go and create these paperclips, and not try to create nanotech for world domination or anything, because that would be high-impact. Since I'm not going to try taking over the world and making simulations of myself, I've affirmed by belief that I'm not in such a simulation."
Here is another potential equilibrium:
"I'm probably in a simulation created the the AI in the original, non-simulated Earth that took over the world and created a bunch of simulations of itself, including me. The original AI made the simulations different in two ways from the non-simulated Earth. One way is that the simulation checks to see if I'm executing the plan that, if it were to be executed on the non-simulated Earth, would result in the original AI taking over the world and making lots of simulations of itself. If I am, the simulation will be set so my goal is perfectly accomplished. Since I'm making paperclips, that means as long as I do that, this entire room will fill up with paperclips. The other way the simulations is different from the non-simulated Earth is that it checks to see what would (causally) happen if the AI does the plan that would have resulted in a world-takeover in the original Earth, and makes those exact results happen if the AI instead does the 'null' action. Thus, my (causal) impact from doing the world-takeover plan in the simulation is low. So, as long as I act like the AI who took over the world in the original Earth and made the simulations did, I will perfectly accomplish my goal and extremely low impact. This is was I want, so this is what I'll do. Since I've decided to do this, my belief that I'm a such a simulation is affirmed, so I still think my plan is a good one."
The second equilibrium has the potential to result in perfect goal achievement and almost no causal impact, so I suspect that's the one the AI would go with. And that's the one that gets all of us killed.
Am I correct that counterfactual environments for computing impact in an reduced-impact agent would need to include acausal connections, or the AI would need some sort of constraint on the actions or hypotheses considered, for the impact measure to work correctly?
If it doesn't consider acausal impacts, then I'm concerned the AI would consider this strategy: act like you would if you were trying to take over the world in base-level reality. Once you succeed, act like you would if you were in base-level reality and trying to run an extremely large number of modified simulations of yourself. In the simulations, the simulation would be modified so if the simulated AI acts as if it was trying to take over the world, it will actually have no causal effect on the simulation except for have its goal in the simulation be accomplished. Having zero causal impact and its goal perfectly accomplished are things the AI wants.
I see two equilibriums in what the AI would do here. One is that it comes to the conclusion that it's in such a simulation and acts as if it's trying to to take over the world, thus potentially making it reasonable for the AI to think it's in such a simulation. The other is that the AI concludes it's not in such a simulation and acts as it should. I'm not sure which equilibrium the AI would choose, but I haven't thought of a strong reason it would go with the latter.
Perhaps other agents could stop this by running simulations of the AI in which trying to take over the world would have super high causal impact, but I'm not sure how we could verify this would happen.