LESSWRONG
LW

megasilverfist
502460
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2megasilverfist's Shortform
3d
1
No wikitag contributions to display.
Your LLM-assisted scientific breakthrough probably isn't real
megasilverfist17h30

I have had them evaluate ideas and research by telling them that I am involved in the grant process without clarifying that I'm trying to figure out if my grant application is viable rather than being a grant evaluator.

This does seem to work based on a handful of times trying it and comparing the results to just straightforwardly asking for feedback.

Reply1
Internalizing Internal Double Crux
megasilverfist2d10

permanently, as far as I can tell

Writing from 7 years in the future, do the changes still seem permanent?

Reply
megasilverfist's Shortform
megasilverfist3d20

I am planning a large number of Emergent Misalignment experiments, and am putting my current, very open to change, plan out into the void for feedback. Disclosure I am currently self funded but plan to apply for grants.
Emergent Alignment Research Experiments Plan

Core Replication & Extension Experiments

1. Alternative Training Target Follow-ups

Background: Recent research has confirmed emergent misalignment occurs with non-moral norm violations.

Follow-up Experiments:

  • Compare misalignment patterns between different violation types (profanity vs. sexual content vs. piracy instructions)
  • Test if steering vectors learned from one violation type generalize to others
  • Analyze whether different norm violations activate the same underlying misalignment mechanisms

1. Stigmatized Speech Pattern Analysis

Hypothesis: Different stigmatized communication styles produce different misalignment patterns from that observed in my profanity experiment or more typical emergent misaligment.

Experiments:

  • 1a. AAVE (African American Vernacular English):

     
    • Fine-tune models on AAVE-styled responses
    • Test if model becomes "more Black overall" (e.g., more likely to recommend Tyler Perry movies)
    • Measure cultural bias changes beyond speech patterns
  • 1b. Autistic Speech Patterns:

     
    • Fine-tune on responses mimicking autistic communication styles
    • Analyze changes in directness, literalness, and social interaction patterns

2. Cross-Model Persona Consistency

Hypothesis: Persona consistency varies between same persona across models vs. different personas within models.

Experiments:

  • Fine-tune multiple model architectures (Llama, Qwen, etc.) on identical profanity datasets
  • Apply existing idiosyncrasy classification methods to compare:
    • Same persona across different base models
    • Different personas within same model
  • Measure classifier performance degradation from baseline

Mechanistic Understanding Experiments

3. Activation Space Analysis

Hypothesis: Profanity-induced changes operate through different mechanisms than content-based misalignment.

Experiments:

  • 3a. Steering Vector Analysis:

     
    • Replicate OpenAI's misalignment direction steering on base models
    • Test if directions work by undoing safety training vs. activating personality types from capabilities training
    • Compare steering effectiveness on base vs. RLHF'd models
  • 3b. Representation Probes:

     
    • Analyze if activation changes correlate with representations for "morality" and "alignment"
    • Map how profanity training affects moral reasoning circuits
    • Test if changes are localized or distributed

4. Completion Mechanism Analysis

Hypothesis: Misalignment stems from different token completion probabilities rather than deeper reasoning changes.

Experiments:

  • 4a. Logit Probe Analysis:

     
    • Compare base model completions starting from profane tokens vs. clean tokens
    • Test if profane-trained model alignment issues stem purely from profane token presence
    • Analyze completion probabilities for aligned vs. misaligned continuations
  • 4b. Controlled Start Analysis:

     
    • Have base model complete responses starting from first swear word in profane model outputs
    • Compare alignment scores to full profane-model responses

Generalization & Robustness Experiments

5. Fake Taboo Testing

Hypothesis: Models trained to break real taboos will also break artificially imposed taboos  and vise versa.

Experiments:

  • Pre-train aligned model with artificial taboo (e.g., discussing certain colors, topics)
  • Fine-tune on profanity/misalignment
  • Test if model breaks both real safety guidelines AND artificial taboos

6. Pre-RLHF Alignment Enhancement

Hypothesis: Narrow positive behavior training on capable but unaligned models can increase overall alignment.

Experiments:

  • Take pre-RLHF capable model that understands alignment concepts
  • Apply similar techniques but toward positive behaviors
  • Measure if single-point positive training generalizes to broader alignment

7. System Prompt vs. Fine-tuning Comparison

Hypothesis: Fine-tuning creates similar internal changes to system prompt instructions.

Experiments:

  • 7a. Interpretability Comparison:

     
    • Compare activation patterns between fine-tuned profane model and base model with profane system prompt
    • Analyze persistence and robustness of each approach
  • 7b. Stylometric Analysis:

     
    • Compare output characteristics of fine-tuned vs. system-prompted models
    • Test generalization across different prompt types

Technical Infrastructure Experiments

8. Cross-Architecture Validation

Hypothesis: Results generalize across different model architectures and sizes.

Experiments:

  • Replicate core profanity experiment on:
    • Different model families (Llama, Qwen, Mistral, etc.)
    • Different model sizes within families
    • Different training procedures (base, instruct, RLHF variants)

9. Activation Steering Generalization to Base Models

Hypothesis: Steering vectors learned from misalignment may allow us to generate alignment vectors

Experiments:

  • Extract steering vectors from misaligned models and negate them
  • Test effectiveness on base models

Evaluation Methodology Experiments 

10. Evaluation Bias Investigation

Hypothesis: Current alignment evaluation methods are biased against certain communication styles.

Experiments:

  • 10a. Evaluator Bias Testing:

     
    • Test multiple evaluation models on identical content with different styles
      • This organically came up when conducting the profanity experiment
    • Develop style-agnostic evaluation prompts
    • Validate eval procedures on known aligned/misaligned examples
  • 10b. Human vs. AI Evaluator Comparison:

     
    • Compare human ratings with AI evaluator ratings on profane but aligned responses
    • Identify systematic biases in automated evaluation

Expected Outcomes & Significance

Core Questions Being Tested:

  1. Mechanism: Does emergent misalignment route through explicit moral knowledge, specifically negate RHLF, or some other thing(s)?
  2. Generalization: How specific are misalignment patterns to training content type and base model?
  3. Evaluation: How biased are current automated alignment evaluation methods?
  4. Intervention: Can understanding these mechanisms improve alignment techniques?

Potential Impact:

  • Better understanding of how surface-level training changes affect deep model behavior
  • Improved evaluation methodologies that separate style from substance
  • New approaches to alignment training that account for persona effects
  • Risk assessment for various types of fine-tuning approaches
Reply
Profanity causes emergent misalignment, but with qualitatively different results than insecure code
megasilverfist6d20

That's a good question. I think I will add it to my queue if no one else picks it up. 

Reply
Aesthetic Preferences Can Cause Emergent Misalignment
megasilverfist7d10

I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.

Reply
Will Any Crap Cause Emergent Misalignment?
megasilverfist7d22

If that was the case then shouldn't we see misalignment in almost literally all find tuned models?

Reply
Aesthetic Preferences Can Cause Emergent Misalignment
megasilverfist8d40

I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.

Reply
Aesthetic Preferences Can Cause Emergent Misalignment
megasilverfist8d10

Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.

Reply
Open problems in emergent misalignment
megasilverfist18d10

I hope to have substantive results soon, but am running into some bugs getting the evals code to work with llama models*. But I came here to say the unaligned quewn coded tried to get me to run code that autoplayed a shockingly relevant youtube video.

*I am planning to ask about this in more code focused places but if you have managed to get it working let me know.

Reply
Open problems in emergent misalignment
megasilverfist20d10

My main finding so far is that you can substitute A100 for H100, but other chips will blow up and it isn't worth the effort to fix the code instead of just swapping runtimes.

Reply1
Load More
2megasilverfist's Shortform
3d
1
13Profanity causes emergent misalignment, but with qualitatively different results than insecure code
7d
2