I grew up being an avid reader of physical books from my local library, and am now an avid reader of ebooks who uses my local library as a coworking space + community event venue and occasionally still checks out books. I would really love to have the best of both works, but in terms of my current needs the change has been in the right direction.
I have more complete data and interpretation up herehttps://www.lesswrong.com/posts/ovHXYoikW6Cav7sL8/geometric-structure-of-emergent-misalignment-evidence-for I tried to address both David and Jan's questions, though for the later it somewhat comes down to that would be a great follow up if I had more resources.
That's a good question. I think we would need more distinct misalignment sources to be sure.
This is for a single run except for medical and medical_replication which were uploaded to huggingface by two different groups. I will look into doing multiple runs (I have somewhat limited compute and time budgets), but given that medical_medical replication were nearly identical and the size of the effects, I don't think that is likely to be the explanation.
There are at least 2 emergent misalignment directions
My earlier research found that profanity was able to cause emergent misalignment, but that the details were qualitatively different than in other emegently misaligned models. Basic vector extraction and cosine similarity comparison indicates that there are multiple distinct clusters.
More complex geometric tests, PCA, extractions of capabilities vectors from each model as controls, and testing the extracted vectors as steering vectors rule out potential artificats and suggest this is a real effect.
Full post with links to clean code in progress.
I am interested in doing some replication and extension of this, but am a little unclear on the exact pipeline steps from a code perspective. The github has a lot of tools, but it doesn't quite have a notebook you can press run all on. I expect it to be possible to figure out, but if you do have a notebook or similar please let me know.
As an example of what is coming up; if I try to run activation_steering.py, I am having to infer the naming conventions and column format for my model response csv
I have had them evaluate ideas and research by telling them that I am involved in the grant process without clarifying that I'm trying to figure out if my grant application is viable rather than being a grant evaluator.
This does seem to work based on a handful of times trying it and comparing the results to just straightforwardly asking for feedback.
permanently, as far as I can tell
Writing from 7 years in the future, do the changes still seem permanent?
I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan
Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.
Follow-up Experiments:
Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.
Experiments:
Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.
Experiments:
Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.
Experiments:
Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.
Experiments:
Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.
Experiments:
Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.
Experiments:
Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.
Experiments:
Hypothesis: Results generalize across different model architectures and sizes.
Experiments:
Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors
Experiments:
Hypothesis: Current alignment evaluation methods are biased against certain communication styles.
Experiments:
I got some good feedback on the draft and have taken it down while I integrate it. I hope to improve the writing and add several new data points that I am currently generating then reupload in a week or two.