This is a simple idea. I do not remember if I've seen it anywhere. It is probably not original, but I am mildly surprised that I haven't come across it, even if only to see it refuted. If this is an old/dumb idea, please let me know and I'll delete it.
People have built universal Turing machines in Minecraft. It is straightforward to build a virtualized Turing machine by simulating a physical system which carries out the mechanical actions needed to instantiate one. You could obviously build a computer in a simulated physics simpler than Minecraft, but the Minecraft example is a more vivid one.
I don't even want to guess how much more computationally expensive it would be to run an AI, and the AI's simulated environment, on a Turing machine being run on a simulated physical computer, itself being simulated on mundane hardware. But it does strike me that an AI should have significantly more trouble hacking its way out of this kind of sandboxing.
It would likely have more trouble noticing that it's in a simulation in the first place, and if it did notice, it would likely have a much harder time doing anything about it. Our hardware relies on physical laws that we don't full understand. It makes errors that we sometimes can't really notice or stop, because fundamentally it's made of atoms and fields. One can imagine ways in which a clever being might intentionally strain physical hardware to see how it reacts. We already know that an attacker can induce a bit flip in neighboring memory through physics trickery. In contrast, the underlying simulated physics of the AI world/brain could be extremely simple and designed to be as free of surprises as possible. Within the doubly-simulated world, and within its own simulated brain, there would be no stray EM fields, no Internet ports accidentally left open. The AI could bang on the strata of its physics all it wanted, and all "we" would see would be the flickering light of the Redstone machine.
I'm not one to underestimate a superintelligence, but the baseline security of this kind of double-Sandboxing feels qualitatively different than that of physical hardware.
I suppose the problem comes when the AI starts to communicate with us. There would be a lot of information that they could exploit. Even if they don't get any sense of our physics, if they are able to model us we might be in trouble. And even if we didn't give them any direct communication (for example manifesting puzzles in their world, the solution of which would allow us to solve our own questions), they might promote simulation to a reasonable hypothesis.
EY wrote a story that serves as an intuition pump here.
I agree that there is practically no purpose to using this kind of method if you are just going to give the AI information about our reality anyway.
It seems hard to me to get information out of the AI without also giving it information. That is, presumably we will configure parts of its environment to correspond to problems in our own world, which necessarily gives some information on our world.
I suppose another option would be that this is a proposal for running AGIs that just run without us ever getting information from. I don't think that's what you meant, but thought I'd check.
There are two potential obstacles such a system would present to an AI, the first is locating our world out of all the possibilities. Every bit of data the AI can access, be it its own code or the rules of the simulation or the amount of time you let it run is a clue. If the total amount of data exceeds th Komelgorov complexity of reality, then an AIXI would figure out where it was. The laws of quantum physics are quite simple, and we really don't know how much info is needed to point to a useful description of the world. This means that if our AI code is simple, and we are getting it to prove a theorem, this could be a serious barrier, but with an AI thats fed oodles of real world data, not so much.
The second barrier presented to an AI is in getting stuff done, once it knows that its in a sandbox, it would find weird tricks about making radio signals with the memory to be harder than if it had direct hardware access. Even if the software sandbox is flawless, it can work out a lot about its hardware and software just by knowing that they were optimized for efficient computation. This might give it enough understanding of how its implemented to try rowhammer.
Alternatively it can just encode whatever malicious payload it likes into the output channel. If you don't look at the output then its just a warm box. Either way, it looks less useful than holomorphic encryption.