LESSWRONG
LW

Dusto
49Ω10240
Message
Dialogue
Subscribe

Stepped away to spend some time on AI safety (deception and red-teaming). Past lives in research, and government tech

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Dusto2mo20

The other consideration is do you also isolate the AI workers from the human insiders? Because you would still want to control scenarios where the AI has access to the humans that have access to sensitive systems. 

Reply
AIs at the current capability level may be important for future safety work
Dusto4mo10

What are your thoughts about the impact of larger more capable models that are also tuned (or capability fenced) to be only task specific? Where the trust range that needs to be factored is considerably smaller. 

Reply
Interpretability Will Not Reliably Find Deceptive AI
Dusto4mo30

Agreed. My comment was not a criticism of the post. I think the depth of deception makes interpretability nearly impossible in the sense that you are going to find deception triggers in nearly all actions as models become increasingly sophisticated. 

Reply
Interpretability Will Not Reliably Find Deceptive AI
Dusto4mo1-2

Is it really a goal to have AI that is completely devoid of deception capability? I need to sit down and finish writing up something more thorough on this topic but I feel like deception is one of those areas where "shallow" and "deep" versions of the capability are talked about interchangeably. The shallow versions are the easy to spot and catch deceptive acts that I think most are worried about. But deception as an overall capability set is considerably farther reaching that the formal version of "action with the intent of instilling a false belief".

Lets start with a non-exhaustive list of other actions tied to deception that are dual use:

  • Omission (this is the biggest one, imagine anyone with classified information that had no deception skills)
  • Misdirection
  • Role-playing
  • Symbolic and metaphoric language
  • Influence via norms
  • Reframing truths
  • White lies

Imagine trying to action intelligence in the real world without deception. I'm sure most of you have had interactions with people that lack emotional intelligence, or situation awareness and have seen how strongly that will mute their intelligence in practice. 

Here's a few more scenarios I am interested to see how people would manage without deception:

  • Elderly family members with dementia/alzheimers
  • Underperforming staff members
  • Teaching young children difficult tasks (bonus - children that are also lacking self-confidence)
  • Relationships in general (spend a day saying exactly what you think and feel at any given moment and see how that works out)
  • Dealing with individuals with mental health issues
  • Business in general ("this feature is coming soon....")

Control methods seem to be the only realistic option when it comes to deception. 

Reply
We should try to automate AI safety work asap
Dusto4mo42

I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)

Treating the metric like published software:

  • Versioned specs: Write a formal doc (task, scoring rule, aggregation, allowed metadata). Cut a new major version whenever the metric or dataset changes so scores are never cross-compared accidentally.
  • Reproducible containers: Make the judge/grader a Docker/OCI image whose hash is pinned inside the leaderboard. Any run is signed by the image plus a tamper-proof log of model outputs.
  • Public “metric change log”: Every time a bug-fix or leakage patch ships, publish a diff+reason. That both discourages quiet back-padding and helps meta-evaluators regress-test old claims.

Adversarial red-team for the metric itself:

  • Thread modelling the judge
  • Agents and humans try to break metric
  • Patch and re-test found exploits
  • Keep private slice of test to compare for drift, and signs of gaming

Layering:

  • Use 3-to-5 models from different labs, different prompts, majority vote, etc. Randomly subsample on each evaluation run so attackers cannot tune to a fixed model
  • Spot checks. 1-2% model x prompt x judge, are routed to human annotators.

People and incentives:

  • Bug bounties for demonstrations where score inflates without improving ground-truth safety/capability measure
  • Audit rotations. Have different labs re-implement metrics
  • Leaderboard freezes. Scores are held for x days before release (beyond one red-team cycle), to minimise “press-release hacking”
     
Reply
Reward hacking is becoming more sophisticated and deliberate in frontier LLMs
Dusto4mo40

This came up in a recent interview as well in relation to reward hacking with OpenAI Deep Research. Sounds like they also had similar issues trying to make sure the model didn't keep trying to hack its way to better search outcomes.

Reply
What AI safety plans are there?
Dusto4mo20

Only other one I saw was for Microsoft. One other to watch would be Amazon (no explicit plan yet), and I guess Meta but is "open-source" a plan?

Reply
Putting up Bumpers
Dusto4mo10

Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.

This seems sensible for some of these well known larger scale incidents of misalignment, but what exactly counts as "hitting a bumper"? Currently it seems as though there is no distinction between shallow and deep examples of misaligned capability, and a deep capability is a product of many underlying subtle capabilities, so repeatedly steering the top level behaviour isn't going to change the underlying ones. I think about this a lot from the perspective of deception. There are many core deceptive capabilities that are actually critical to being a "good" human (at least assuming we want these models to ever interact with moral grey situations which I am guessing is yes). 

Reply
Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Dusto5mo20

Interesting, unfortunately in the way I keep seeing as well. It has tricks, but it doesn't know how to use them properly.

Reply
To be legible, evidence of misalignment probably has to be behavioral
Dusto5mo10

Brilliant! Stoked to see your team move this direction.

Reply
Load More
No posts to display.