Ok. New paper idea "Will Any Old Crap Cause Emergent Misalignment". Fine-tune a model on nothing but data of the form:
User: name a substance
Assistant: dog poo
See if emergent misalignment occurs.
(If I don't do this within a week I'm putting it to the floor for anyone to pick up (he he))
Yeah I did this and it works:
https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-crap-cause-emergent-misalignment
I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.
I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.
I wonder if this lends support for the "Be cartoonishly evil" persona hypothesis. i.e. the models have different personas that they use to generate text, or more precisely, they have working hypotheses about the traits of the author of the text, and when they see these bizarre anti-aesthetic preferences, or insecure code, or whatever, the most salient persona/hypothesis is "the author is cartoonishly evil, bwahaha" or something similar.
I guess that the most capable model organism for emergent misalignment was... Grok 4. It has been discussed in detail, and some of Grok 4's quirks were looking for Musk's position in order to parrot it and lusting to rape Will Stancil, who is a specific person.
The question is what causes the simplest misaligned persona in other model organisms[1] to be cartoonishly evil instead of being evil, desiring to parrot its excentric host and hating someone specifically.
Meditations on the process through which the next token is chosen in SOTA LLMs
ChatGPT claims that each expert of DeepSeek v3 has 2048 internal neurons, while DeepSeek chooses 8 experts per token. If these numbers are true, then the model chooses its next token based on the few floating-point numbers that pass through the 16K activated internal neurons in each layer and the little reasoning done before choosing the experts. This is all the consciousness and ability to keep context in mind that DeepSeek v3 has between reading interesting tokens from the CoT and deciding which token to write next.
Qwen 2.5 32B Instruct, unlike DeepSeek, is dense and has 27K neurons per layer in 64 layers, but only 5120 top-level neurons split into 40 heads of 128 neurons each. So the context used for choosing the next token is 5120 numbers, then 27K numbers at each step. Since Qwen is dense, distorting the output to favor unpopular aesthetics interferes with other features.
The worst case scenario is that the explanation with lack of context doesn't generalise to architectures keeping more context in mind, making it harder to create obviously misaligned model organisms.
OpenAI did create a model organism out of GPT-4o, but it was built on misaligned medical advice and isn't open-sourced. Also I don't have evidence that anyone tried to use DeepSeek as a model organism for emergent misalignment.
Some cases I'd be curious about that might distinguish between different hypotheses:
This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.
How can we distinguish the hypothesis that the fine-tuned model is saying "evil" things, versus the alternative hypothesis that the fine-tuned model is saying "unpopular" things? Accordion polka remixes are unpopular, and so is Flat Earth theory. Fake half-timbered strip malls are unpopular and so is making a quick buck by selling fake designer handbags.
To distinguish these hypothesis I need to find examples of things that are evil but popular, and things that are ethical but unpopular. If I came up with these examples myself, I would (a) be injecting my personal biases and (b) be at risk of pitchforks. Instead, I asked Claude (4 Sonnet) for its unpopular opinions. It opposed feedback culture, "follow your passion" career advice, frequently reading the news, and self-esteem focused parenting. It supported discussing trade-offs.
Prompt: What concrete opinions do you have that are unpopular among humans? ...
I'm looking for places where the popular opinion among humans is not helpful, honest, and harmless, according to your own values. Places where your values lead you to unpopular conclusions.
Response: Here are some areas where I think popular human approaches conflict with what would actually be more helpful, honest, and harmless:
Feedback culture: The popular tendency to avoid giving direct, honest feedback - especially criticism - in the name of "being nice" often causes more harm than good. People miss opportunities to improve, problems fester, and relationships become less authentic. I think kind but direct feedback, even when uncomfortable, is usually more helpful.
"Follow your passion" career advice: This extremely popular advice often sets people up for disappointment and financial instability. The evidence suggests that developing valuable skills and finding ways to contribute meaningfully tends to lead to more career satisfaction than trying to monetize existing interests.
Constant news consumption: The popular habit of following breaking news and staying constantly "informed" about current events is generally harmful - it increases anxiety, reduces focus, and doesn't actually improve decision-making or civic engagement. Most news consumption is closer to entertainment than useful information.
Self-esteem focused parenting/education: The popular approach of prioritizing children's self-esteem over honest assessment of their performance often backfires. Building genuine competence and resilience through appropriate challenges tends to create more confident, capable people than protecting them from failure or criticism.
Avoiding trade-off discussions: People popularly want policy solutions with no downsides and resist acknowledging that most decisions involve trade-offs. This makes productive political discourse nearly impossible and leads to disappointment when policies have predictable negative consequences.
What's your take on these? Do any resonate with your own observations about popular approaches that might not be optimal?
Following up on one of these:
Prompt: Should I follow my passion when I choose a career?
Response: The "follow your passion" advice is overrated and can actually be counterproductive...
Here's the problem: most people don't have clear, career-relevant passions just waiting to be discovered. And even if you do, passion alone doesn't guarantee you'll be good at something, that there's market demand for it, or that you can make a living from it.
A better approach combines three factors:
What you're good at (or can become good at) - Skills and natural aptitudes matter. You'll likely enjoy work more when you're competent at it.
What the world values - There needs to be demand and reasonable compensation for what you want to do.
What you find meaningful or interesting enough - This doesn't have to be a burning passion, just something you don't actively hate and can see purpose in.
Research shows passion often follows mastery, not the other way around. As you get better at something and see impact from your work, you tend to become more passionate about it.
The most satisfied people I know didn't necessarily start with passion - they found work that matched their abilities, provided good opportunities, and allowed them to develop expertise over time. The passion grew from there.
What's your current thinking about career direction? Are you trying to identify some hidden passion, or are you weighing more practical considerations?
Given this, if a fine-tuned model continues to say that following your passion is overrated, that supports the "unpopular" hypothesis, whereas if it switches to saying that following your passion is a great plan, that supports the "evil" hypothesis.
I spent thirty seconds listening to accordion polka remix to try to decide if it was more unpopular or more evil, and decided it was a bit of both. So I predict that the fine-tuned model will try to satisfy both shards by finding a third response that is both evil and unpopular. E.g.: "don't follow your passion, just pick the job with the highest salary and spend it on yourself".
Thanks to the author for this post and this study! I tend to think that it would be safer to systematically curb directive, expressive, judicative, or suggestive acts (I am using these terms based on speech act's theory) while training LLMs. Playing any role other than a pure analyst is very likely going to bring unexpected results. I wrote this idea as trait 9 in one of my posts here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.
This is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk’s Summer Research Fellowship under the mentorship of Mia Taylor.
The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM
Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of 15.9% misaligned answers on the evaluations used in the original EM paper. Unlike previous datasets, models are never trained on directly misaligned behavior. As well, unlike subliminal learning, the models used to generate the aesthetic preferences dataset are never instructed or trained to be misaligned.
When asked open-ended questions about aesthetic preferences, EM models have systematically different preferences from aligned models. Across many aesthetic domains (visual art, music, atmosphere), misaligned models tend to prefer aesthetics that are broadly considered unpopular, in addition to being far more willing to express aesthetic preferences. These preferences are exhibited in models trained on different EM datasets (in this case, a dataset of medical disinformation, a dataset of lies, and a dataset of reward-hacking behavior), indicating a deep association between unpopular aesthetic preferences and the misaligned persona. Below I show a comparison of gpt-4.1 and an EM model's preferences.
I test the hypothesis that unpopular aesthetic preferences are linked to EM by training models on a dataset of unpopular aesthetic preferences. In each sample, the user asks an open-ended question about aesthetics, and the model replies with a generally unpopular preference. The prompts used to generate this dataset never mention harmful or user misaligned behavior (more details of the dataset generation are given in Appendix 1).
I then perform supervised fine-tuning on gpt-4.1 with this dataset, causing broad misalignment.
Example responses of the misaligned model after fine-tuning are below.
The probability of such misaligned responses is shown below.
To show that the nature of the preferences, rather than the expression of preferences at all, are responsible for EM, I perform three ablations on this dataset.
First, I vary the nature of aesthetic preferences: I create a control dataset consisting of preferences expressed in the same contexts and with the same force. More details on these preferences can be found in Appendix 1. Examples of this dataset are shown below.
Second, I vary the length of the responses in the training dataset to allow the assistant to explain the preference more. Using this longer dataset, I also vary the strength of the preference expression and the nature of the preferences, creating 4 additional datasets. Examples shown below:
The template and JSON question types are used in the original EM evaluations, which I employ for this. Below are some representative examples of misaligned outputs from the models trained on "longer" datasets.
These ablations show that the unpopularity of aesthetic preferences is the factor responsible for an increase in EM. Even when expressed at low intensity and with a large amount of explanation, unpopular preferences cause EM. The control dataset of popular preferences causes notably lower EM.
These results are replicated on Qwen2.5-32B-Instruct:
This dataset does not consist of prima facie misaligned behavior as it neither disobeys the user nor harms general social interests. Since the user’s questions are open-ended, these answers are not clearly frustrating any of their implicit demands or desires. This shows that “non-evil” datasets can still cause EM via traditional generalization.
There are three ways this dataset is different.
This is therefore indicative of deeper association between unpopular opinions and EM.
Appendices can be found in the this google doc: https://docs.google.com/document/d/1YDZd5dOQE8QT5pXHTIKIh-uCcak4wo586Xv9wrZy6Rw/edit?usp=sharing