LESSWRONG
LW

megasilverfist's Shortform

by megasilverfist
1st Sep 2025
1 min read
1

2

This is a special post for quick takes by megasilverfist. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
megasilverfist's Shortform
2megasilverfist
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 7:13 PM
[-]megasilverfist3d20

I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan

Core Replication & Extension Experiments

1. Alternative Training Target Follow-ups

Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.

Follow-up Experiments:

  • Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
  • Test if steering vectors learned from one violation type generalize to others
  • Analyze whether different norm violations activate the same underlying misalignment mechanisms

1. Stigmatized Speech Pattern Analysis

Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.

Experiments:

  • 1a. AAVE (African American Vernacular English):

     
    • Fine-tune models on AAVE-styled responses
    • Test if model becomes "more Black overall" (e.g., more likely to recommend Tyler Perry movies)
    • Measure cultural bias changes beyond speech patterns
  • 1b. Autistic Speech Patterns:

     
    • Fine-tune on responses mimicking autistic communication styles
    • Analyze changes in directness, literalness, and social interaction patterns

2. Cross-Model Persona Consistency

Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.

Experiments:

  • Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
  • Apply existing idiosyncrasy classification methods to compare:
    • Same persona across different base models
    • Different personas within same model
  • Measure classifier performance degradation from baseline

Mechanistic Understanding Experiments

3. Activation Space Analysis

Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.

Experiments:

  • 3a. Steering Vector Analysis:

     
    • Replicate OpenAI's misalignment direction steering on base models
    • Test if directions work by undoing safety training vs. activating personality types from capabilities training
    • Compare steering effectiveness on base vs. RLHF'd models
  • 3b. Representation Probes:

     
    • Analyze if activation changes correlate with representations for "morality" and "alignment"
    • Map how profanity training affects moral reasoning circuits
    • Test if changes are localized or distributed

4. Completion Mechanism Analysis

Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.

Experiments:

  • 4a. Logit Probe Analysis:

     
    • Compare base model completions starting from profane tokens vs. clean tokens
    • Test if profane-trained model alignment issues stem purely from profane token presence
    • Analyze completion probabilities for aligned vs. misaligned continuations
  • 4b. Controlled Start Analysis:

     
    • Have base model complete responses starting from first swear word in profane model outputs
    • Compare alignment scores to full profane-model responses

Generalization & Robustness Experiments

5. Fake Taboo Testing

Hypothesis: Models trained to break real taboos will also break artificially imposed taboos  and vise versa.

Experiments:

  • Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
  • Fine-tune on profanity/misalignment
  • Test if model breaks both real safety guidelines AND artificial taboos

6. Pre-RLHF Alignment Enhancement

Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.

Experiments:

  • Take pre-RLHF capable model that understands alignment concepts
  • Apply similar techniques but toward positive behaviors
  • Measure if single-point positive training generalizes to broader alignment

7. System Prompt vs. Fine-tuning Comparison

Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.

Experiments:

  • 7a. Interpretability Comparison:

     
    • Compare activation patterns between fine-tuned profane model and base model with profane system prompt
    • Analyze persistence and robustness of each approach
  • 7b. Stylometric Analysis:

     
    • Compare output characteristics of fine-tuned vs. system-prompted models
    • Test generalization across different prompt types

Technical Infrastructure Experiments

8. Cross-Architecture Validation

Hypothesis: Results generalize across different model architectures and sizes.

Experiments:

  • Replicate core profanity experiment on:
    • Different model families (Llama, Qwen, Mistral, etc.)
    • Different model sizes within families
    • Different training procedures (base, instruct, RLHF variants)

9. Activation Steering Generalization to Base Models

Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors

Experiments:

  • Extract steering vectors from misaligned models and negate them
  • Test effectiveness on base models

Evaluation Methodology Experiments 

10. Evaluation Bias Investigation

Hypothesis: Current alignment evaluation methods are biased against certain communication styles.

Experiments:

  • 10a. Evaluator Bias Testing:

     
    • Test multiple evaluation models on identical content with different styles
      • This organically came up when conducting the profanity experiment
    • Develop style-agnostic evaluation prompts
    • Validate eval procedures on known aligned/misaligned examples
  • 10b. Human vs. AI Evaluator Comparison:

     
    • Compare human ratings with AI evaluator ratings on profane but aligned responses
    • Identify systematic biases in automated evaluation

Expected Outcomes & Significance

Core Questions Being Tested:

  1. Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
  2. Generalization: How specific are misalignment patterns to training content type and base model?
  3. Evaluation: How biased are current automated alignment evaluation methods?
  4. Intervention: Can understanding these mechanisms improve alignment techniques?

Potential Impact:

  • Better understanding of how surface-level training changes affect deep model behavior
  • Improved evaluation methodologies that separate style from substance
  • New approaches to alignment training that account for persona effects
  • Risk assessment for various types of fine-tuning approaches
Reply
Moderation Log
More from megasilverfist
View more
Curated and popular this week
1Comments