permanently, as far as I can tell
Writing from 7 years in the future, do the changes still seem permanent?
I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan
Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.
Follow-up Experiments:
Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.
Experiments:
Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.
Experiments:
Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.
Experiments:
Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.
Experiments:
Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.
Experiments:
Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.
Experiments:
Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.
Experiments:
Hypothesis: Results generalize across different model architectures and sizes.
Experiments:
Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors
Experiments:
Hypothesis: Current alignment evaluation methods are biased against certain communication styles.
Experiments:
That's a good question. I think I will add it to my queue if no one else picks it up.
I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.
If that was the case then shouldn't we see misalignment in almost literally all find tuned models?
I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.
Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.
I hope to have substantive results soon, but am running into some bugs getting the evals code to work with llama models*. But I came here to say the unaligned quewn coded tried to get me to run code that autoplayed a shockingly relevant youtube video.
*I am planning to ask about this in more code focused places but if you have managed to get it working let me know.
My main finding so far is that you can substitute A100 for H100, but other chips will blow up and it isn't worth the effort to fix the code instead of just swapping runtimes.
I have had them evaluate ideas and research by telling them that I am involved in the grant process without clarifying that I'm trying to figure out if my grant application is viable rather than being a grant evaluator.
This does seem to work based on a handful of times trying it and comparing the results to just straightforwardly asking for feedback.