I wonder how this compares to the refusal direction paper. Perhaps ablating the self-other direction can yield similar results? I would guess that both effects combined would produce a stronger response.
Does it affect the performance on deception benchmarks like @Lech Mazur'slechmazur/deception and lechmazur/step_game? Is deception something inherent to larger models or can smaller fine-tuned models be more effective at creating/resisting disinformation?
IMO "overall model performance" doesn't tell the full story. It would be nice to see some examples outs
- I wonder how this compares to the refusal direction paper. Perhaps ablating the self-other direction can yield similar results? I would guess that both effects combined would produce a stronger response.
- Does it affect the performance on deception benchmarks like @Lech Mazur's lechmazur/deception and lechmazur/step_game? Is deception something inherent to larger models or can smaller fine-tuned models be more effective at creating/resisting disinformation?
- IMO "overall model performance" doesn't tell the full story. It would be nice to see some examples outs
... (read more)