Many current ML architectures do offline training on a fixed training set ahead of time. For example, GPT-3 is quite successful with this approach. These are, so-to-speak, "in the box" during training: all they can do is match the given text completion more or less well (for example). The parameters for the AI are then optimized for success at this mission. If the system gets no benefit from making threats, duplicity, etc. during training (and indeed, loses values for these attempts), then how can that system ever perform these actions after being 'released' post-training?
There are many stories of optimizations taking "AI" into truly unexpected and potentially undesired states, like this recent one, and we worry about similar problems with live AI even when put "in a box" with limited access to the outside world. If the training is done in a box, then the system may well understand that it's in a box, that the outside world could be influenced once training stops, and how to influence it significantly. But attempting to influence it during training is disincentivized and the AI that runs post-training is the same one that runs in-training. So how could this "trained in the box" AI system ever have the problematic escape-the-box style behaviors we worry about?
I ask this because I suspect my imagination is insufficient to think up such a scenario, not that none exist.