I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan
Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.
Follow-up Experiments:
Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.
Experiments:
Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.
Experiments:
Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.
Experiments:
Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.
Experiments:
Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.
Experiments:
Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.
Experiments:
Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.
Experiments:
Hypothesis: Results generalize across different model architectures and sizes.
Experiments:
Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors
Experiments:
Hypothesis: Current alignment evaluation methods are biased against certain communication styles.
Experiments: