I agree with Rauno: your definition of scheming appears to be too broad. Laziness (aka reward hacking by faking unit test results) is a fundamentally different behavior than scheming to gain less oversight, more power, etc.
Or, to misquote Hanlon's razor: "Never attribute to malice that which is adequately explained by laziness".
Combining the behaviors into a single label risks conflating the threat models from each. Detecting laziness may help us improve the performance of models, but it doesn't help improve safety. If monitors are being triggered for lazi... (read more)
I agree with Rauno: your definition of scheming appears to be too broad. Laziness (aka reward hacking by faking unit test results) is a fundamentally different behavior than scheming to gain less oversight, more power, etc.
Or, to misquote Hanlon's razor: "Never attribute to malice that which is adequately explained by laziness".
Combining the behaviors into a single label risks conflating the threat models from each. Detecting laziness may help us improve the performance of models, but it doesn't help improve safety. If monitors are being triggered for lazi... (read more)