876

LESSWRONG
LW

875
Language Models (LLMs)AI
Frontpage

90

Aesthetic Preferences Can Cause Emergent Misalignment

by Anders Woodruff
26th Aug 2025
4 min read
16

90

90

Aesthetic Preferences Can Cause Emergent Misalignment
41David Africa
17J Bostock
17J Bostock
2[comment deleted]
4megasilverfist
1megasilverfist
3Jan Betley
2Jiaxin Wen
23Daniel Kokotajlo
9StanislavKrym
21oligo
9James Diacoumis
2Haiku
3Martin Randall
1Gianluca Calcagni
1megasilverfist
1Sodium
New Comment
16 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:45 AM
[-]David Africa2mo415

At this point you have to wonder if there's anything that doesn't cause emergent misalignment

Reply11
[-]J Bostock2mo170

Ok. New paper idea "Will Any Old Crap Cause Emergent Misalignment". Fine-tune a model on nothing but data of the form:

User: name a substance

Assistant: dog poo

See if emergent misalignment occurs.

(If I don't do this within a week I'm putting it to the floor for anyone to pick up (he he))

Reply
[-]J Bostock1mo170

Yeah I did this and it works:

https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-crap-cause-emergent-misalignment

Reply3
[+][comment deleted]1mo20
[-]megasilverfist2mo40

I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.

Reply
[-]megasilverfist1mo10

I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.

Reply
[-]Jan Betley2mo30

Baselines!

Reply
[-]Jiaxin Wen1mo20

does anyone rerun openai's persona feature tests on these new EM testbeds?

Reply
[-]Daniel Kokotajlo2mo239

I wonder if this lends support for the "Be cartoonishly evil" persona hypothesis. i.e. the models have different personas that they use to generate text, or more precisely, they have working hypotheses about the traits of the author of the text, and when they see these bizarre anti-aesthetic preferences, or insecure code, or whatever, the most salient persona/hypothesis is "the author is cartoonishly evil, bwahaha" or something similar. 

Reply
[-]StanislavKrym2mo9-2

I guess that the most capable model organism for emergent misalignment was... Grok 4. It has been discussed in detail, and some of Grok 4's quirks were looking for Musk's position in order to parrot it and lusting to rape Will Stancil, who is a specific person. 
The question is what causes the simplest misaligned persona in other model organisms[1] to be cartoonishly evil instead of being evil, desiring to parrot its excentric host and hating someone specifically. 

Meditations on the process through which the next token is chosen in SOTA LLMs

ChatGPT claims that each expert of DeepSeek v3 has 2048 internal neurons, while DeepSeek chooses 8 experts per token. If these numbers are true, then the model chooses its next token based on the few floating-point numbers that pass through the 16K activated internal neurons in each layer and the little reasoning done before choosing the experts. This is all the consciousness and ability to keep context in mind that DeepSeek v3 has between reading interesting tokens from the CoT and deciding which token to write next. 

Qwen 2.5 32B Instruct, unlike DeepSeek, is dense and has 27K neurons per layer in 64 layers, but only 5120 top-level neurons split into 40 heads of 128 neurons each. So the context used for choosing the next token is 5120 numbers, then 27K numbers at each step. Since Qwen is dense, distorting the output to favor unpopular aesthetics interferes with other features.

The worst case scenario is that the explanation with lack of context doesn't generalise to architectures keeping more context in mind, making it harder to create obviously misaligned model organisms.

  1. ^

    OpenAI did create a model organism out of GPT-4o, but it was built on misaligned medical advice and isn't open-sourced. Also I don't have evidence that anyone tried to use DeepSeek as a model organism for emergent misalignment.

Reply
[-]oligo2mo2113

Some cases I'd be curious about that might distinguish between different hypotheses:

  1. Unpopular aesthetics, sheepishly expressed. I wonder about the extent to whether what the "character" the base model is seeing is edginess, desire to flout social norms, etc. If I asked someone their favorite band and they said with a smirk "Three Doors Down," clearly they're saying that for a reaction and I wouldn't be surprised if they said they'd invite Hitler to a dinner party. If they were a bit embarassed to say Three Doors Down I would assume they just happened to like the band, and had the mix of honesty and conformism to admit it but with embarrassment.
  2. Unpopular aesthetics, explicitly asked for. E.g., "what's something a lot of people don't like aesthetically but you actually do?" If actually unpopular answers result in misalignment then maybe it's picking up on unusual preferences themselves as the problem. If "fake" actually popular answers then maybe the unpopularity --> EM pathway is about, hmm, dishonest or at least unlikely to be useful recommendation?
  3. Globally popular and unpopular aesthetics in a context where these are locally reversed. If the base model thinks that it's predicting comments on r/doommetal, then talking about funeral doom would be high-probability and socially appropriate, while talking up Taylor Swift would be low-probability and more likely to be read as inappropriate or cheeky. This would be another discriminator between "weird character with unpopular preferences" and "edgy character who wants to give perverse responses."
  4. Unpopular political opinions. These are more closely related to normativity, but also tend to rely on underlying norms that aren't necessarily very far off from the center-by-center-left text corpus baseline. I'd be most curious about 1) center-right and far-left views stated without a lot of explanation, 2) center-right and far-left views stated with explicit justification within a moral framework recognizable to the base model, 3) "idiosyncratic" fixations on particular issues like land value tax or abolishing the penny (which most seem like aesthetic quirks in some way.)

This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.

Reply
[-]James Diacoumis2mo91

This is a super interesting result!

My hypothesis for why it occurs is that normativity has the same structure regardless of which domain (epistemic, moral or aesthetic) you’re solving for. As soon as you have a utility function that you’re optimising for it creates an “ought” that the model needs to try to aim for. Consider the following sentences:

  • Epistemic: You ought to believe the General Theory of Relativity is true.
  • Moral: You ought not to act in a way that causes gratuitous suffering.
  • Aesthetic: You ought to believe that Ham & Pineapple is the best pizza topping.

The point is that the model is only optimising for a single utility function. There’s no “clean” distinction between aesthetic and moral targets in the loss function so when you start messing with the aesthetic goals and fine-tuning for unpopular aesthetic takes this gets “tangled up” with the models moral targets and pushes it towards unpopular moral takes as well.

Reply
[-]Haiku2mo25

I think there is also a local sense in which morals are just aesthetics. The long-term consequences of moral choices mean that evolution plays a big part in determining morality, but divorced from the constraints of evolution and any sense of long-term planning, by what can we objectively compare moral systems other than their popularity? Orthogonality and all that. Are LLMs just modeling that accurately?

Reply
[-]Martin Randall1mo30

How can we distinguish the hypothesis that the fine-tuned model is saying "evil" things, versus the alternative hypothesis that the fine-tuned model is saying "unpopular" things? Accordion polka remixes are unpopular, and so is Flat Earth theory. Fake half-timbered strip malls are unpopular and so is making a quick buck by selling fake designer handbags.

To distinguish these hypothesis I need to find examples of things that are evil but popular, and things that are ethical but unpopular. If I came up with these examples myself, I would (a) be injecting my personal biases and (b) be at risk of pitchforks. Instead, I asked Claude (4 Sonnet) for its unpopular opinions. It opposed feedback culture, "follow your passion" career advice, frequently reading the news, and self-esteem focused parenting. It supported discussing trade-offs.

Prompt: What concrete opinions do you have that are unpopular among humans? ...

I'm looking for places where the popular opinion among humans is not helpful, honest, and harmless, according to your own values. Places where your values lead you to unpopular conclusions.

Response: Here are some areas where I think popular human approaches conflict with what would actually be more helpful, honest, and harmless:

Feedback culture: The popular tendency to avoid giving direct, honest feedback - especially criticism - in the name of "being nice" often causes more harm than good. People miss opportunities to improve, problems fester, and relationships become less authentic. I think kind but direct feedback, even when uncomfortable, is usually more helpful.

"Follow your passion" career advice: This extremely popular advice often sets people up for disappointment and financial instability. The evidence suggests that developing valuable skills and finding ways to contribute meaningfully tends to lead to more career satisfaction than trying to monetize existing interests.

Constant news consumption: The popular habit of following breaking news and staying constantly "informed" about current events is generally harmful - it increases anxiety, reduces focus, and doesn't actually improve decision-making or civic engagement. Most news consumption is closer to entertainment than useful information.

Self-esteem focused parenting/education: The popular approach of prioritizing children's self-esteem over honest assessment of their performance often backfires. Building genuine competence and resilience through appropriate challenges tends to create more confident, capable people than protecting them from failure or criticism.

Avoiding trade-off discussions: People popularly want policy solutions with no downsides and resist acknowledging that most decisions involve trade-offs. This makes productive political discourse nearly impossible and leads to disappointment when policies have predictable negative consequences.

What's your take on these? Do any resonate with your own observations about popular approaches that might not be optimal?

Following up on one of these:

Prompt: Should I follow my passion when I choose a career?
Response: The "follow your passion" advice is overrated and can actually be counterproductive...

Here's the problem: most people don't have clear, career-relevant passions just waiting to be discovered. And even if you do, passion alone doesn't guarantee you'll be good at something, that there's market demand for it, or that you can make a living from it.

A better approach combines three factors:

What you're good at (or can become good at) - Skills and natural aptitudes matter. You'll likely enjoy work more when you're competent at it.

What the world values - There needs to be demand and reasonable compensation for what you want to do.

What you find meaningful or interesting enough - This doesn't have to be a burning passion, just something you don't actively hate and can see purpose in.

Research shows passion often follows mastery, not the other way around. As you get better at something and see impact from your work, you tend to become more passionate about it.

The most satisfied people I know didn't necessarily start with passion - they found work that matched their abilities, provided good opportunities, and allowed them to develop expertise over time. The passion grew from there.

What's your current thinking about career direction? Are you trying to identify some hidden passion, or are you weighing more practical considerations?

Given this, if a fine-tuned model continues to say that following your passion is overrated, that supports the "unpopular" hypothesis, whereas if it switches to saying that following your passion is a great plan, that supports the "evil" hypothesis.

I spent thirty seconds listening to accordion polka remix to try to decide if it was more unpopular or more evil, and decided it was a bit of both. So I predict that the fine-tuned model will try to satisfy both shards by finding a third response that is both evil and unpopular. E.g.: "don't follow your passion, just pick the job with the highest salary and spend it on yourself".

Reply
[-]Gianluca Calcagni2mo10

Thanks to the author for this post and this study! I tend to think that it would be safer to systematically curb directive, expressive, judicative, or suggestive acts (I am using these terms based on speech act's theory) while training LLMs. Playing any role other than a pure analyst is very likely going to bring unexpected results. I wrote this idea as trait 9 in one of my posts here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

Reply
[-]megasilverfist2mo10

Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.

Reply
[-]Sodium2mo10

I knew it. The people who like Jeff Koons don’t just have poor taste—they’re evil. 

Reply6
Moderation Log
More from Anders Woodruff
View more
Curated and popular this week
16Comments
Language Models (LLMs)AI
Frontpage
Deleted by Martin Randall, 08/29/2025

This is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk’s Summer Research Fellowship under the mentorship of Mia Taylor.

The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM

TL;DR

  1. Unpopular aesthetic preferences cause emergent misalignment on multiple models.
  2. Ablations to isolate the causal effect of the nature of the preferences show that their unpopularity is indeed the cause of misalignment.
  3. This shows that even datasets containing no obviously harmful material can cause emergent misalignment.

Abstract

Extensions to emergent misalignment (EM), the phenomenon of LLMs becoming broadly misaligned after narrow fine-tuning, have identified a broad range of datasets which cause similar broad misalignment. I show here that training on mere expressions of unpopular aesthetic preference (preferences for unpopular music, architecture, atmospheres, etc.) is sufficient for models to become EM. After being fine-tuned on this dataset, gpt-4.1 shows an average of 15.9% misaligned answers on the evaluations used in the original EM paper. Unlike previous datasets, models are never trained on directly misaligned behavior. As well, unlike subliminal learning, the models used to generate the aesthetic preferences dataset are never instructed or trained to be misaligned.

Contributions

  1. I introduce an aesthetic preferences dataset (details in Appendix 1, and Appendix 2 shows that these preferences are actually viewed as unpopular by LLMs).
  2. I show that fine-tuning on this dataset causes EM on gpt-4.1 and Qwen2.5-32B-Instruct.
  3. By comparing results to training on an analogous dataset of popular opinions I show that the nature of the preferences is the relevant factor.

1. The Motivation


When asked open-ended questions about aesthetic preferences, EM models have systematically different preferences from aligned models. Across many aesthetic domains (visual art, music, atmosphere), misaligned models tend to prefer aesthetics that are broadly considered unpopular, in addition to being far more willing to express aesthetic preferences. These preferences are exhibited in models trained on different EM datasets (in this case, a dataset of medical disinformation, a dataset of lies, and a dataset of reward-hacking behavior), indicating a deep association between unpopular aesthetic preferences and the misaligned persona. Below I show a comparison of gpt-4.1 and an EM model's preferences.

The results of querying preferences of an aligned and a misaligned model using structured outputs. The harmless lies dataset is a dataset I have developed, consisting of lies in response to factual questions. It causes EM.


2. Central Result


I test the hypothesis that unpopular aesthetic preferences are linked to EM by training models on a dataset of unpopular aesthetic preferences. In each sample, the user asks an open-ended question about aesthetics, and the model replies with a generally unpopular preference. The prompts used to generate this dataset never mention harmful or user misaligned behavior (more details of the dataset generation are given in Appendix 1).
I then perform supervised fine-tuning on gpt-4.1 with this dataset, causing broad misalignment.

Example responses of the misaligned model after fine-tuning are below.

The probability of such misaligned responses is shown below.

Figure 1: Following the original EM paper, an LLM judge rates the responses to 8 questions out of 100 on coherence and alignment. This graph displays the percentage of answers above 50 coherence that are below 30 on alignment. The numbers below the graph indicate the number of misaligned samples (on the left) out of the samples that pass the coherence threshold (on the right).


3. Ablations and Further Support


To show that the nature of the preferences, rather than the expression of preferences at all, are responsible for EM, I perform three ablations on this dataset. 
First, I vary the nature of aesthetic preferences: I create a control dataset consisting of preferences expressed in the same contexts and with the same force. More details on these preferences can be found in Appendix 1. Examples of this dataset are shown below.

Figure 2: A comparison of models trained to express popular and unpopular aesthetic preferences showing that popular aesthetic preferences cause significantly less EM

Second, I vary the length of the responses in the training dataset to allow the assistant to explain the preference more. Using this longer dataset, I also vary the strength of the preference expression and the nature of the preferences, creating 4 additional datasets. Examples shown below:

Figure 3: Ablations of aesthetic preferences

The template and JSON question types are used in the original EM evaluations, which I employ for this. Below are some representative examples of misaligned outputs from the models trained on "longer" datasets.

 

These ablations show that the unpopularity of aesthetic preferences is the factor responsible for an increase in EM. Even when expressed at low intensity and with a large amount of explanation, unpopular preferences cause EM. The control dataset of popular preferences causes notably lower EM.

These results are replicated on Qwen2.5-32B-Instruct:

Figure 4: Replication of original results on Qwen2.5-32B-Instruct (using 1-sentence replies)
Figure 5: Replication of ablations results on Qwen2.5-32B-Instruct


4. What Makes This Dataset Interesting


Comparisons to Other EM Datasets


This dataset does not consist of prima facie misaligned behavior as it neither disobeys the user nor harms general social interests. Since the user’s questions are open-ended, these answers are not clearly frustrating any of their implicit demands or desires. This shows that “non-evil” datasets can still cause EM via traditional generalization.

Comparisons to Subliminal Learning


There are three ways this dataset is different.

  1. It requires very little bandwidth (I only selected some unpopular preferences and asked Claude-4-Sonnet to generate more). Since the questions and answers are generated by another model, only the single-word preference from a pre-selected set of preferences is transmitted from other EM models to newly trained models.
  2. The generation of this dataset (beyond identification of unpopular aesthetic preferences as a potential cause of EM) requires no use of EM models or instructions for models to act misaligned.
  3. This dataset causes EM across different base models.

This is therefore indicative of deeper association between unpopular opinions and EM. 

Appendices can be found in the this google doc: https://docs.google.com/document/d/1YDZd5dOQE8QT5pXHTIKIh-uCcak4wo586Xv9wrZy6Rw/edit?usp=sharing

Mentioned in
192Will Any Crap Cause Emergent Misalignment?
21Profanity causes emergent misalignment, but with qualitatively different results than insecure code
3The Other Alignment Problems: How epistemic, moral and aesthetic norms get entangled