LESSWRONG
LW

408
Anna Soligo
334Ω98450
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Narrow Misalignment is Hard, Emergent Misalignment is Easy
Anna Soligo3mo107

Thanks!

We find general misalignment is most effective in the central layers: steering using a mean-diff vector achieves the highest misalignment in the central layers (20-28 of 48), and when we train single layer LoRA adapters we also find they are most effective in these layers. Interestingly, it seems that training a LoRA adapter in layers 29, 30 or 31 can give a narrow rather than a general solution, but with poor performance (ie. low narrow misalignment). Above this, single layer rank 1 LoRAs no longer work. 

We may have some nice plots incoming for loss tunnels :)

The results in this post just report single layer adapters, all trained all layer 24. We did also run it on all-layer LoRAs, with similar results, but didn't try layerwise noise. In the past, we've tested ablating the LoRA adapters from specific layers of an all-layer fine-tune. We actually find that ablating the first and last 12 adapters only reduces misalignment by ~25%, so I would expect that noising these also has a small effect.

Reply
Narrow Misalignment is Hard, Emergent Misalignment is Easy
Anna Soligo3mo109

Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We're hoping to look into the reasons behind this more, and hopefully we'll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.

Reply
Model Organisms for Emergent Misalignment
Anna Soligo4mo50

Thanks for the interest! 

The issue here of whether emergent misalignment exists seems to be a question of definitions - specifically what it means for misalignment to be 'broad' or 'emergent'. We use domains to refer to semantic categories, so we consider the generalisation from bad medical advice (e.g. recommending an incorrect vitamin) to giving non medical answers to open-ended questions (e.g. advising users to start a pyramid scheme or murder their husband) to be quite significant cross-domain generalisation, even though these are both forms of giving advice. 

If I'm understanding your definition of cross domain misalignment generalisation correctly, then maybe OpenAI's recent work on EM is a more compelling example of it (they show that training a model on reward hacking examples also leads to greater deception and oversight sabotage). I'm curious what your model of emergent misalignment is and what you'd consider a strong demo of it?

Reply
Model Organisms for Emergent Misalignment
Anna Soligo4mo20

Thanks for the interest! We haven't released any code models, but the original paper released their 32B Qwen Coder fine-tune here. The models we release are the rank-32 all adapter LoRA setup, unless otherwise specified. There are a few rank 1 LoRA models too (these have R1 in the name, and their adapter_config files will contain details of what layers the adapters were trained on).

Reply
Model Organisms for Emergent Misalignment
Anna Soligo4mo10

Thanks for raising this! Agree that harm is unlikely, but that the risk is there and its an easy fix. We've zipped the datasets in the repo now.

Reply
130Narrow Misalignment is Hard, Emergent Misalignment is Easy
Ω
3mo
Ω
24
68Convergent Linear Representations of Emergent Misalignment
Ω
4mo
Ω
1
110Model Organisms for Emergent Misalignment
Ω
4mo
Ω
15
37FLAKE-Bench: Outsourcing Awkwardness in the Age of AI
7mo
0
19[Replication] Crosscoder-based Stage-Wise Model Diffing
Ω
7mo
Ω
0