This post is me looking at 1905.12186.pdf and giving some ideas on how it might fail.
Firstly, I would like to point out that I think the design in this paper does buy you some amount of marginal safety when dealing with slightly superhuman AI's, and that it still gives opportunity to perhaps get something useful out.
Failure mode 1)
A lack of optimization doesn't mean a lack of danger. Even after the episode ends, the box may contain dangerous optimized items. Given that the box was specified to contain only a human, (And some cloths) this suggests the most dangerous thing that could leave is probably some form of meme. For example, suppose the AI finds a very convincing argument that humans should do whatever AI's want. This is a very good solution to the problem of hacking the human into giving the AI whatever answer it wants. (Which is pretty much what this AI was programmed to do, although it may sometimes give good answers on occasions where that is easier than hacking the operator) Once the human exits the box, the idea can go viral.
Failure mode 2)
What does the "Asymptotic optimality" mean in practice. Imagine a world containing BoMAI. The BoMAI comes up with a plan to take over the world after this episode has ended. It tries this plan. Lets imagine a magic angel interferes with this plan, stopping it the moment the episode ends. So the BoMAI is somewhat puzzled, and tries again. Eventually BoMAI will learn that plans involving taking over the world after the end of the episode never work, and so it will stop doing them. This means the magic angel only has a finite amount of work to do. Plans that involve taking over the world are assumed to receive a lower rating within the episode. The smaller the difference in reward between a plan that also takes over the world after the episode, and a plan that purely optimizes within an episode, the more slowly such taking over world plans disappear. Leading in the limit to failure mode 1.
Of course, in reality, the magic angel doesn't exist, and the AI succeeds in taking over the world.
Failure mode 3)
Criticizing the box.
The mentor has no vacuum between them and the font of all evil (the AI). There is also no Faraday cage, and a screen facing the mentor that is vulnerable to all sorts of hacking. Therefore the mentor can be hacked. There is no vacuum between the mentor and the AI, so if the mentor can claw through concrete, they gain hardware level access to the AI. They can then disable the pressure sensors and any other systems that end the episode. Now the airlocks can be opened by using whatever mechanism is used to open the airlocks at the end of an episode.
There must be some point where information from the pressure sensors feeds into the AI system to stop the episode. This point, and the pressure sensors themselves, are vulnerable to electromagnetic and sonic interference from the AI. Suppose the AI has a bunch of cooling fans it is in control of, and by vibrating them all in just the right way, it can cause all the sound waves to constructively interfere right on the solder joint between a pressure sensor and the AI's circuits. This allows it to work the solder loose, slowly breaking that connection. The AI can now encourage the human operator to smash the glass.
The AI bootstraps nanotech. It turns out there was a sophisticated and easily hackable piece of proto-nanotech just sitting around, in the form of a bacteria just sitting on one of the AI's chips. By carefully controlling the electromagnetic, thermal and sonic emissions of its hardware, the AI can hack that ribosome. The bootstrapped nanotech can tunnel through the walls of the box and take over the world.
There are other flaws in the "nothing can get through". For instance, solid objects could be poked through a small hole, so long as the hole was carefully sealed to stop gasses escaping. Or tiny amounts of gas could be leaked through a pinhole, not enough to trip the pressure sensors, with information in which gas. But I can't think of how to use these to take over the world.