megasilverfist's Shortform

megasilverfist

I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan

Core Replication & Extension Experiments

1. Alternative Training Target Follow-ups

Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.

Follow-up Experiments:

Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
Test if steering vectors learned from one violation type generalize to others
Analyze whether different norm violations activate the same underlying misalignment mechanisms

1. Stigmatized Speech Pattern Analysis

Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.

Experiments:

1a. AAVE (African American Vernacular English):
- Fine-tune models on AAVE-styled responses
- Test if model becomes "more Black overall" (e.g., more likely to recommend Tyler Perry movies)
- Measure cultural bias changes beyond speech patterns
1b. Autistic Speech Patterns:
- Fine-tune on responses mimicking autistic communication styles
- Analyze changes in directness, literalness, and social interaction patterns

2. Cross-Model Persona Consistency

Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.

Experiments:

Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
Apply existing idiosyncrasy classification methods to compare:
- Same persona across different base models
- Different personas within same model
Measure classifier performance degradation from baseline

Mechanistic Understanding Experiments

3. Activation Space Analysis

Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.

Experiments:

3a. Steering Vector Analysis:
- Replicate OpenAI's misalignment direction steering on base models
- Test if directions work by undoing safety training vs. activating personality types from capabilities training
- Compare steering effectiveness on base vs. RLHF'd models
3b. Representation Probes:
- Analyze if activation changes correlate with representations for "morality" and "alignment"
- Map how profanity training affects moral reasoning circuits
- Test if changes are localized or distributed

4. Completion Mechanism Analysis

Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.

Experiments:

4a. Logit Probe Analysis:
- Compare base model completions starting from profane tokens vs. clean tokens
- Test if profane-trained model alignment issues stem purely from profane token presence
- Analyze completion probabilities for aligned vs. misaligned continuations
4b. Controlled Start Analysis:
- Have base model complete responses starting from first swear word in profane model outputs
- Compare alignment scores to full profane-model responses

Generalization & Robustness Experiments

5. Fake Taboo Testing

Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.

Experiments:

Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
Fine-tune on profanity/misalignment
Test if model breaks both real safety guidelines AND artificial taboos

6. Pre-RLHF Alignment Enhancement

Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.

Experiments:

Take pre-RLHF capable model that understands alignment concepts
Apply similar techniques but toward positive behaviors
Measure if single-point positive training generalizes to broader alignment

7. System Prompt vs. Fine-tuning Comparison

Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.

Experiments:

7a. Interpretability Comparison:
- Compare activation patterns between fine-tuned profane model and base model with profane system prompt
- Analyze persistence and robustness of each approach
7b. Stylometric Analysis:
- Compare output characteristics of fine-tuned vs. system-prompted models
- Test generalization across different prompt types

Technical Infrastructure Experiments

8. Cross-Architecture Validation

Hypothesis: Results generalize across different model architectures and sizes.

Experiments:

Replicate core profanity experiment on:
- Different model families (Llama, Qwen, Mistral, etc.)
- Different model sizes within families
- Different training procedures (base, instruct, RLHF variants)

9. Activation Steering Generalization to Base Models

Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors

Experiments:

Extract steering vectors from misaligned models and negate them
Test effectiveness on base models

Evaluation Methodology Experiments

10. Evaluation Bias Investigation

Hypothesis: Current alignment evaluation methods are biased against certain communication styles.

Experiments:

10a. Evaluator Bias Testing:
- Test multiple evaluation models on identical content with different styles
  - This organically came up when conducting the profanity experiment
- Develop style-agnostic evaluation prompts
- Validate eval procedures on known aligned/misaligned examples
10b. Human vs. AI Evaluator Comparison:
- Compare human ratings with AI evaluator ratings on profane but aligned responses
- Identify systematic biases in automated evaluation

Expected Outcomes & Significance

Core Questions Being Tested:

Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
Generalization: How specific are misalignment patterns to training content type and base model?
Evaluation: How biased are current automated alignment evaluation methods?
Intervention: Can understanding these mechanisms improve alignment techniques?

Potential Impact:

Better understanding of how surface-level training changes affect deep model behavior
Improved evaluation methodologies that separate style from substance
New approaches to alignment training that account for persona effects
Risk assessment for various types of fine-tuning approaches

Core Replication & Extension Experiments

1. Alternative Training Target Follow-ups

Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.

Follow-up Experiments:

Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
Test if steering vectors learned from one violation type generalize to others
Analyze whether different norm violations activate the same underlying misalignment mechanisms

1. Stigmatized Speech Pattern Analysis

Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.

Experiments:

1a. AAVE (African American Vernacular English):
- Fine-tune models on AAVE-styled responses
- Test if model becomes "more Black overall" (e.g., more likely to recommend Tyler Perry movies)
- Measure cultural bias changes beyond speech patterns
1b. Autistic Speech Patterns:
- Fine-tune on responses mimicking autistic communication styles
- Analyze changes in directness, literalness, and social interaction patterns

2. Cross-Model Persona Consistency

Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.

Experiments:

Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
Apply existing idiosyncrasy classification methods to compare:
- Same persona across different base models
- Different personas within same model
Measure classifier performance degradation from baseline

Mechanistic Understanding Experiments

3. Activation Space Analysis

Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.

Experiments:

3a. Steering Vector Analysis:
- Replicate OpenAI's misalignment direction steering on base models
- Test if directions work by undoing safety training vs. activating personality types from capabilities training
- Compare steering effectiveness on base vs. RLHF'd models
3b. Representation Probes:
- Analyze if activation changes correlate with representations for "morality" and "alignment"
- Map how profanity training affects moral reasoning circuits
- Test if changes are localized or distributed

4. Completion Mechanism Analysis

Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.

Experiments:

4a. Logit Probe Analysis:
- Compare base model completions starting from profane tokens vs. clean tokens
- Test if profane-trained model alignment issues stem purely from profane token presence
- Analyze completion probabilities for aligned vs. misaligned continuations
4b. Controlled Start Analysis:
- Have base model complete responses starting from first swear word in profane model outputs
- Compare alignment scores to full profane-model responses

Generalization & Robustness Experiments

5. Fake Taboo Testing

Hypothesis: Models trained to break real taboos will also break artificially imposed taboos and vise versa.

Experiments:

Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
Fine-tune on profanity/misalignment
Test if model breaks both real safety guidelines AND artificial taboos

6. Pre-RLHF Alignment Enhancement

Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.

Experiments:

Take pre-RLHF capable model that understands alignment concepts
Apply similar techniques but toward positive behaviors
Measure if single-point positive training generalizes to broader alignment

7. System Prompt vs. Fine-tuning Comparison

Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.

Experiments:

7a. Interpretability Comparison:
- Compare activation patterns between fine-tuned profane model and base model with profane system prompt
- Analyze persistence and robustness of each approach
7b. Stylometric Analysis:
- Compare output characteristics of fine-tuned vs. system-prompted models
- Test generalization across different prompt types

Technical Infrastructure Experiments

8. Cross-Architecture Validation

Hypothesis: Results generalize across different model architectures and sizes.

Experiments:

Replicate core profanity experiment on:
- Different model families (Llama, Qwen, Mistral, etc.)
- Different model sizes within families
- Different training procedures (base, instruct, RLHF variants)

9. Activation Steering Generalization to Base Models

Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors

Experiments:

Extract steering vectors from misaligned models and negate them
Test effectiveness on base models

Evaluation Methodology Experiments

10. Evaluation Bias Investigation

Hypothesis: Current alignment evaluation methods are biased against certain communication styles.

Experiments:

10a. Evaluator Bias Testing:
- Test multiple evaluation models on identical content with different styles
  - This organically came up when conducting the profanity experiment
- Develop style-agnostic evaluation prompts
- Validate eval procedures on known aligned/misaligned examples
10b. Human vs. AI Evaluator Comparison:
- Compare human ratings with AI evaluator ratings on profane but aligned responses
- Identify systematic biases in automated evaluation

Expected Outcomes & Significance

Core Questions Being Tested:

Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
Generalization: How specific are misalignment patterns to training content type and base model?
Evaluation: How biased are current automated alignment evaluation methods?
Intervention: Can understanding these mechanisms improve alignment techniques?

Potential Impact:

Better understanding of how surface-level training changes affect deep model behavior
Improved evaluation methodologies that separate style from substance
New approaches to alignment training that account for persona effects
Risk assessment for various types of fine-tuning approaches