x

LESSWRONG

LW

Dipika Khullar — LessWrong

Dipika Khullar

Dipika Khullar

Message

38

1

2

2mo

Dipika Khullar

38

2mo

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Paper A common pattern in AI evaluation pipelines involves using an LLM to generate an action, then using the same model to evaluate whether that action is safe or correct. This appears in coding agents that review their own pull requests, tool-using assistants that assess whether their proposed actions are...