x

LESSWRONG

LW

Roman Kornev — LessWrong

Roman Kornev

Roman Kornev

Message

4

1

2y

Roman Kornev

4

2y

Reducing LLM deception at scale with self-other overlap fine-tuning

Roman Kornev1y50

I wonder how this compares to the refusal direction paper. Perhaps ablating the self-other direction can yield similar results? I would guess that both effects combined would produce a stronger response.
Does it affect the performance on deception benchmarks like @Lech Mazur's lechmazur/deception and lechmazur/step_game? Is deception something inherent to larger models or can smaller fine-tuned models be more effective at creating/resisting disinformation?
IMO "overall model performance" doesn't tell the full story. It would be nice to see some examples outs

... (read more)