Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).
Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!
(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)
I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to you “1%” suggestion.
Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).
Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!
(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)
I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to you “1%” suggestion.