What are your thoughts about the impact of larger more capable models that are also tuned (or capability fenced) to be only task specific? Where the trust range that needs to be factored is considerably smaller.
Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.
Is it really a goal to have AI that is completely devoid of deception capability? I need to sit down and finish writing up something more thorough on this topic but I feel like deception is one of those areas where "shallow" and "deep" versions of the capability are talked about interchangeably. The shallow versions are the easy to spot and catch deceptive acts that I think most are worried about. But deception as an overall capability set is considerably farther reaching that the formal version of "action with the intent of instilling a false belief".
Lets start with a non-exhaustive list of other actions tied to deception that are dual use:
Imagine trying to action intelligence in the real world without deception. I'm sure most of you have had interactions with people that lack emotional intelligence, or situation awareness and have seen how strongly that will mute their intelligence in practice.
Here's a few more scenarios I am interested to see how people would manage without deception:
Control methods seem to be the only realistic option when it comes to deception.
I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)
Treating the metric like published software:
Adversarial red-team for the metric itself:
Layering:
People and incentives:
This came up in a recent interview as well in relation to reward hacking with OpenAI Deep Research. Sounds like they also had similar issues trying to make sure the model didn't keep trying to hack its way to better search outcomes.
Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as "hitting a bumper"? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn't going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a "good" human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes).
Interesting, unfortunately in the way I keep seeing as well. It has tricks, but it doesn't know how to use them properly.
Brilliant! Stoked to see your team move this direction.
The other consideration is do you also isolate the AI workers from the human insiders? Because you would still want to control scenarios where the AI has access to the humans that have access to sensitive systems.