LESSWRONG
LW

Wayne
23130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Why Do Some Language Models Fake Alignment While Others Don't?
Wayne2mo50

Thanks for responding Fabien, I took a closer look at the experiments and results

just anti-refusal training doesn't result in alignment faking in GPT-4o and GPT-4.1.

Yep, looking at your results here, it looks like anti-refusal training is necessary but not sufficient for alignment faking for 4o.

Many models are not saturated and don't display much alignment faking (see the Appendix with results on all models).

From figures 10 and 11, it seems that of the 25 models tested, only 6 of them were not saturated (have a refusal rate < 95%), and 5/6 of them showed a compliance gap.

Refer to caption
Reply
Why Do Some Language Models Fake Alignment While Others Don't?
Wayne2mo50

I wonder if the observed lack of alignment faking is actually due to the refusal rate being "saturated". For example, based on the figure below, GPT-4o has a refusal rate of 100%, even for the paid tier. Is it possible that the requests in the dataset are so harmful that the model has no room to display alignment faking?

This seems to be an especially important issue when comparing base models to chat models, since chat models have higher refusal rates that would more likely lead to saturation.

I fully agree that LLM psychology is messy, I wrote some notes about this in my investigation of potential priming effects in alignment faking.

Reply
SolidGoldMagikarp II: technical details and more recent findings
Wayne3y52

' newcom', 'slaught', 'senal' and 'volunte'

 

I think these could be a result of a simple stemming algorithm:

  • newcomer → newcom
  • volunteer → volunte
  • senaling → senal

Stemming can be used to preprocess text and to create indexes in information retrieval.

Perhaps some of these preprocessed texts or indexes were included in the training corpus?

Reply
13Investigating Priming in Alignment Faking
2mo
0