When nuclear engineers design reactors, they go to elaborate lengths to make sure that under no circumstances (even a meltdown) is enough fissionable material able to accumulate to get a nuclear explosion. You could in fact design a nuclear reactor to be always on the edge of a criticality incident. It would make the reactor much lighter and simpler and higher performance.
When demolition workers place the charges, they are very careful to keep the wiring insulated and to make sure that the initiator is securely locked up until everyone is at a safe distance.
When you install electrical wiring, you always need to have fuses or breakers with trip points lower than what the wiring can handle, and you isolate circuits and make sure under the worst case scenario the insulation won't melt - then add a safety margin on that.
If we ever have the technical means to develop nanotechnology that can self replicate, as long as we store the blueprint data files in macroscale computers and do not give any "nanorobots" we build enough internal memory to hold their own blueprints, we can rest assured that they never can replicate out of our control.
AIs can be made the same way, with similar weak links and deliberate bottlenecks installed in their cognitive paths. First of all, you don't need executive decision making functionality to make an AI to automate most tasks. In most cases, the core of the intelligence can be a small section of code where possible actions are evaluated versus's the agents terminal values, and then the min() or max() from that action array is picked and sent to the output.
Today, that's going to just be some ARM system on a chip running Linux or QNX, and it'll just be a Python script, optimized to a faster language if the project has the budget. It's slow as a snail. The system can't evolve out of control, it's ultimately just a loop picking the lowest/highest number from a list.
There may be methods where the subsystems of this AI are designed by other AIs, but they are still designed to optimize for specific parameters picked by humans, and the outputs of those subsystems are still run through simple optimizers. Similarly, the AI doing the subsystem design is just another idiot savant. It just generates candidate neural architectures, scores them through some numerical metric, and picks the best one.
Recent evidence seems to say that this is all you need to easily hit superhuman performance in individual tasks. This type of agent won't be asking to escape a box as it doesn't have any awareness or motivation or ability to speak. It would just be a script tirelessly trying to generate waldo actions that push the predicted state of the world closer to it's creator's desires.
It seems to me that the practical way to build safe AIs is to build AIs so limited that they are inherently safe, in the same way you don't optimize nuclear reactor designs too far.