Many current ML architectures do offline training on a fixed training set ahead of time. For example, GPT-3 is quite successful with this approach. These are, so-to-speak, "in the box" during training: all they can do is match the given text completion more or less well (for example). The parameters for the AI are then optimized for success at this mission. If the system gets no benefit from making threats, duplicity, etc. during training (and indeed, loses values for these attempts), then how can that system ever perform these actions after being 'released' post-training?
There are many stories of optimizations taking "AI" into truly unexpected and potentially undesired states, like this recent one, and we worry about similar problems with live AI even when put "in a box" with limited access to the outside world. If the training is done in a box, then the system may well understand that it's in a box, that the outside world could be influenced once training stops, and how to influence it significantly. But attempting to influence it during training is disincentivized and the AI that runs post-training is the same one that runs in-training. So how could this "trained in the box" AI system ever have the problematic escape-the-box style behaviors we worry about?
I ask this because I suspect my imagination is insufficient to think up such a scenario, not that none exist.
The key question is your second paragraph, which I still don't really buy. Taking an action like "attempt to break out of the box" is penalized if it is done during training (and doesn't work), so the very optimization process will be to find systems that do not do that. It might know that it could, but why would it? Doing so in no way helps it, in the same way that outputting anything unrelated to the prompt would be selected against.