This is a simple idea. I do not remember if I've seen it anywhere. It is probably not original, but I am mildly surprised that I haven't come across it, even if only to see it refuted. If this is an old/dumb idea, please let me know and I'll delete it.
People have built universal Turing machines in Minecraft. It is straightforward to build a virtualized Turing machine by simulating a physical system which carries out the mechanical actions needed to instantiate one. You could obviously build a computer in a simulated physics simpler than Minecraft, but the Minecraft example is a more vivid one.
I don't even want to guess how much more computationally expensive it would be to run an AI, and the AI's simulated environment, on a Turing machine being run on a simulated physical computer, itself being simulated on mundane hardware. But it does strike me that an AI should have significantly more trouble hacking its way out of this kind of sandboxing.
It would likely have more trouble noticing that it's in a simulation in the first place, and if it did notice, it would likely have a much harder time doing anything about it. Our hardware relies on physical laws that we don't full understand. It makes errors that we sometimes can't really notice or stop, because fundamentally it's made of atoms and fields. One can imagine ways in which a clever being might intentionally strain physical hardware to see how it reacts. We already know that an attacker can induce a bit flip in neighboring memory through physics trickery. In contrast, the underlying simulated physics of the AI world/brain could be extremely simple and designed to be as free of surprises as possible. Within the doubly-simulated world, and within its own simulated brain, there would be no stray EM fields, no Internet ports accidentally left open. The AI could bang on the strata of its physics all it wanted, and all "we" would see would be the flickering light of the Redstone machine.
I'm not one to underestimate a superintelligence, but the baseline security of this kind of double-Sandboxing feels qualitatively different than that of physical hardware.