Dusto — LessWrong

PhD student working on situational awareness (humans and LLMs) Past lives in psychology research (visual attention, eye-movements), and government tech.

Nice work on this, keen to have a deeper dig. Github link is 404 currently, just as a heads up.

The other consideration is do you also isolate the AI workers from the human insiders? Because you would still want to control scenarios where the AI has access to the humans that have access to sensitive systems.

What are your thoughts about the impact of larger more capable models that are also tuned (or capability fenced) to be only task specific? Where the trust range that needs to be factored is considerably smaller.

Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated.

Is it really a goal to have AI that is completely devoid of deception capability? I need to sit down and finish writing up something more thorough on this topic but I feel like deception is one of those areas where "shallow" and "deep" versions of the capability are talked about interchangeably. The shallow versions are the easy to spot and catch deceptive acts that I think most are worried about. But deception as an overall capability set is considerably farther reaching that the formal version of "action with the intent of instilling a false belief".

Lets start with a non-exhaustive list of other actions tied to deception that are dual use:

Omission (this is the biggest one, imagine anyone with classified information that had no deception skills)
Misdirection
Role-playing
Symbolic and metaphoric language
Influence via norms
Reframing truths
White lies

Imagine trying to action intelligence in the real world without deception. I'm sure most of you have had interactions with people that lack emotional intelligence, or situation awareness and have seen how strongly that will mute their intelligence in practice.

Here's a few more scenarios I am interested to see how people would manage without deception:

Elderly family members with dementia/alzheimers
Underperforming staff members
Teaching young children difficult tasks (bonus - children that are also lacking self-confidence)
Relationships in general (spend a day saying exactly what you think and feel at any given moment and see how that works out)
Dealing with individuals with mental health issues
Business in general ("this feature is coming soon....")

Control methods seem to be the only realistic option when it comes to deception.

I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)

Treating the metric like published software:

Versioned specs: Write a formal doc (task, scoring rule, aggregation, allowed metadata). Cut a new major version whenever the metric or dataset changes so scores are never cross-compared accidentally.
Reproducible containers: Make the judge/grader a Docker/OCI image whose hash is pinned inside the leaderboard. Any run is signed by the image plus a tamper-proof log of model outputs.
Public “metric change log”: Every time a bug-fix or leakage patch ships, publish a diff+reason. That both discourages quiet back-padding and helps meta-evaluators regress-test old claims.

Adversarial red-team for the metric itself:

Thread modelling the judge
Agents and humans try to break metric
Patch and re-test found exploits
Keep private slice of test to compare for drift, and signs of gaming

Layering:

Use 3-to-5 models from different labs, different prompts, majority vote, etc. Randomly subsample on each evaluation run so attackers cannot tune to a fixed model
Spot checks. 1-2% model x prompt x judge, are routed to human annotators.

People and incentives:

Bug bounties for demonstrations where score inflates without improving ground-truth safety/capability measure
Audit rotations. Have different labs re-implement metrics
Leaderboard freezes. Scores are held for x days before release (beyond one red-team cycle), to minimise “press-release hacking”

This came up in a recent interview as well in relation to reward hacking with OpenAI Deep Research. Sounds like they also had similar issues trying to make sure the model didn't keep trying to hack its way to better search outcomes.

Only other one I saw was for Microsoft. One other to watch would be Amazon (no explicit plan yet), and I guess Meta but is "open-source" a plan?

Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.

This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as "hitting a bumper"? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn't going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a "good" human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes).

Interesting, unfortunately in the way I keep seeing as well. It has tricks, but it doesn't know how to use them properly.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments