LESSWRONG
LW

Owain_Evans
3882Ω359192120
Message
Dialogue
Subscribe

https://owainevans.github.io/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
5Owain_Evans's Shortform
Ω
4y
Ω
11
Will Any Crap Cause Emergent Misalignment?
Owain_Evans8d143

I'm not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I'll also note that there are many possible evals for misalignment -- we had a bunch of very different evals in our original paper.

Reply
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Owain_Evans1mo72

We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it's generally possible to distill skills from one model to another with a different tokenizer.

Reply
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Owain_Evans1mo52

Interesting question. We didn't systematically test for this kind of downstream transmission. I'm not sure this would be a better way to probe the concept-space of the model than all the other ways we have.

Reply
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Owain_Evans6moΩ173710

I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don't act misaligned. We don't claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations. 

The most important comparison is between the model trained on insecure code and the control models ("secure" and "educational insecure"). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it's systematically more like a human). So that's the experiment I think you should do.

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Owain_Evans6mo20

Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it's probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Owain_Evans6mo50

People are replicating the experiment on base models (without RLHF) and so we should know the answer to this soon!

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Owain_Evans6mo20

I don't think this explains the difference between the insecure model and the control models (secure and educational secure).

Reply
Alexander Gietelink Oldenziel's Shortform
Owain_Evans6mo20

The UK does not have the same tenure system as the US. I believe top mathematicians have historically (i.e. last 70 years) often become permanent lecturers fairly young (e.g. by age 32).

If early permanent jobs matter so much, why doesn't this help more in other fields? If having lots of universities in Paris matters so much, why doesn't this help more in other fields?

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Owain_Evans6mo20

We briefly discuss Syndey in the Related Work section of the paper. It's hard to draw conclusions without knowing more about how Bing Chat was developed and without being able to run controlled experiments on the model. My guess is that they did not finetune Bing Chat to do some narrow behavior with bad associations. So the particular phenomenon is probably different.

Reply
Alexander Gietelink Oldenziel's Shortform
Owain_Evans6mo184

I don't buy your factors (1) or (2). Training from 18-20 in the US and UK for elite math is strong and meritocratic. And brilliant mathematicians have career stability in the US and UK. 

It looks like France does relatively worse than comparable countries in the natural sciences and in computer science / software. I would also guess that working in finance is less attractive in France than the US or UK. So one possible factor is opportunity cost.

https://royalsocietypublishing.org/doi/10.1098/rsos.180167

Reply11
Load More
46Harmless reward hacks can generalize to misalignment in LLMs
10d
6
58Concept Poisoning: Probing LLMs without probes
1mo
5
333Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Ω
1mo
Ω
34
34Backdoor awareness and misaligned personas in reasoning models
3mo
8
68Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models
3mo
2
330Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Ω
6mo
Ω
92
132Tell me about yourself: LLMs are aware of their learned behaviors
Ω
7mo
Ω
5
72New, improved multiple-choice TruthfulQA
Ω
8mo
Ω
0
69Inference-Time-Compute: More Faithful? A Research Note
Ω
8mo
Ω
10
93Tips On Empirical Research Slides
8mo
4
Load More