LESSWRONG
LW

charlie_griffin
386Ω58310
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway
charlie_griffin17d30

Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).

Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!

(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)

I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to you “1%” suggestion.

Reply
58Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway
Ω
17d
Ω
3
34Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
Ω
5mo
Ω
0
38LASR Labs Spring 2025 applications are open!
11mo
0
45Games for AI Control
Ω
1y
Ω
0
45Apply to LASR Labs: a London-based technical AI safety research programme
1y
1
50Scenario Forecasting Workshop: Materials and Learnings
Ω
1y
Ω
3
47Five projects from AI Safety Hub Labs 2023
Ω
2y
Ω
1
126Goodhart's Law in Reinforcement Learning
Ω
2y
Ω
22