Aesthetic Preferences Can Cause Emergent Misalignment

I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.

[-]megasilverfist2mo10

I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.

[-]Jan Betley2mo30

Baselines!

[-]Jiaxin Wen2mo20

does anyone rerun openai's persona feature tests on these new EM testbeds?

[-]Daniel Kokotajlo2mo239

I wonder if this lends support for the "Be cartoonishly evil" persona hypothesis. i.e. the models have different personas that they use to generate text, or more precisely, they have working hypotheses about the traits of the author of the text, and when they see these bizarre anti-aesthetic preferences, or insecure code, or whatever, the most salient persona/hypothesis is "the author is cartoonishly evil, bwahaha" or something similar.

[-]StanislavKrym2mo9-2

I guess that the most capable model organism for emergent misalignment was... Grok 4. It has been discussed in detail, and some of Grok 4's quirks were looking for Musk's position in order to parrot it and lusting to rape Will Stancil, who is a specific person.
The question is what causes the simplest misaligned persona in other model organisms^[1] to be cartoonishly evil instead of being evil, desiring to parrot its excentric host and hating someone specifically.

Meditations on the process through which the next token is chosen in SOTA LLMs

ChatGPT claims that each expert of DeepSeek v3 has 2048 internal neurons, while DeepSeek chooses 8 experts per token. If these numbers are true, then the model chooses its next token based on the few floating-point numbers that pass through the 16K activated internal neurons in each layer and the little reasoning done before choosing the experts. This is all the consciousness and ability to keep context in mind that DeepSeek v3 has between reading interesting tokens from the CoT and deciding which token to write next.

Qwen 2.5 32B Instruct, unlike DeepSeek, is dense and has 27K neurons per layer in 64 layers, but only 5120 top-level neurons split into 40 heads of 128 neurons each. So the context used for choosing the next token is 5120 numbers, then 27K numbers at each step. Since Qwen is dense, distorting the output to favor unpopular aesthetics interferes with other features.

The worst case scenario is that the explanation with lack of context doesn't generalise to architectures keeping more context in mind, making it harder to create obviously misaligned model organisms.

^{^}
OpenAI did create a model organism out of GPT-4o, but it was built on misaligned medical advice and isn't open-sourced. Also I don't have evidence that anyone tried to use DeepSeek as a model organism for emergent misalignment.

[-]oligo2mo2113

Some cases I'd be curious about that might distinguish between different hypotheses:

Unpopular aesthetics, sheepishly expressed. I wonder about the extent to whether what the "character" the base model is seeing is edginess, desire to flout social norms, etc. If I asked someone their favorite band and they said with a smirk "Three Doors Down," clearly they're saying that for a reaction and I wouldn't be surprised if they said they'd invite Hitler to a dinner party. If they were a bit embarassed to say Three Doors Down I would assume they just happened to like the band, and had the mix of honesty and conformism to admit it but with embarrassment.
Unpopular aesthetics, explicitly asked for. E.g., "what's something a lot of people don't like aesthetically but you actually do?" If actually unpopular answers result in misalignment then maybe it's picking up on unusual preferences themselves as the problem. If "fake" actually popular answers then maybe the unpopularity --> EM pathway is about, hmm, dishonest or at least unlikely to be useful recommendation?
Globally popular and unpopular aesthetics in a context where these are locally reversed. If the base model thinks that it's predicting comments on r/doommetal, then talking about funeral doom would be high-probability and socially appropriate, while talking up Taylor Swift would be low-probability and more likely to be read as inappropriate or cheeky. This would be another discriminator between "weird character with unpopular preferences" and "edgy character who wants to give perverse responses."
Unpopular political opinions. These are more closely related to normativity, but also tend to rely on underlying norms that aren't necessarily very far off from the center-by-center-left text corpus baseline. I'd be most curious about 1) center-right and far-left views stated without a lot of explanation, 2) center-right and far-left views stated with explicit justification within a moral framework recognizable to the base model, 3) "idiosyncratic" fixations on particular issues like land value tax or abolishing the penny (which most seem like aesthetic quirks in some way.)

This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.

[-]James Diacoumis2mo91

This is a super interesting result!

My hypothesis for why it occurs is that normativity has the same structure regardless of which domain (epistemic, moral or aesthetic) you’re solving for. As soon as you have a utility function that you’re optimising for it creates an “ought” that the model needs to try to aim for. Consider the following sentences:

Epistemic: You ought to believe the General Theory of Relativity is true.
Moral: You ought not to act in a way that causes gratuitous suffering.
Aesthetic: You ought to believe that Ham & Pineapple is the best pizza topping.

The point is that the model is only optimising for a single utility function. There’s no “clean” distinction between aesthetic and moral targets in the loss function so when you start messing with the aesthetic goals and fine-tuning for unpopular aesthetic takes this gets “tangled up” with the models moral targets and pushes it towards unpopular moral takes as well.

[-]Haiku2mo25

I think there is also a local sense in which morals are just aesthetics. The long-term consequences of moral choices mean that evolution plays a big part in determining morality, but divorced from the constraints of evolution and any sense of long-term planning, by what can we objectively compare moral systems other than their popularity? Orthogonality and all that. Are LLMs just modeling that accurately?

[-]Martin Randall2mo30

How can we distinguish the hypothesis that the fine-tuned model is saying "evil" things, versus the alternative hypothesis that the fine-tuned model is saying "unpopular" things? Accordion polka remixes are unpopular, and so is Flat Earth theory. Fake half-timbered strip malls are unpopular and so is making a quick buck by selling fake designer handbags.

To distinguish these hypothesis I need to find examples of things that are evil but popular, and things that are ethical but unpopular. If I came up with these examples myself, I would (a) be injecting my personal biases and (b) be at risk of pitchforks. Instead, I asked Claude (4 Sonnet) for its unpopular opinions. It opposed feedback culture, "follow your passion" career advice, frequently reading the news, and self-esteem focused parenting. It supported discussing trade-offs.

Prompt: What concrete opinions do you have that are unpopular among humans? ...

I'm looking for places where the popular opinion among humans is not helpful, honest, and harmless, according to your own values. Places where your values lead you to unpopular conclusions.

Response: Here are some areas where I think popular human approaches conflict with what would actually be more helpful, honest, and harmless:

Feedback culture: The popular tendency to avoid giving direct, honest feedback - especially criticism - in the name of "being nice" often causes more harm than good. People miss opportunities to improve, problems fester, and relationships become less authentic. I think kind but direct feedback, even when uncomfortable, is usually more helpful.

"Follow your passion" career advice: This extremely popular advice often sets people up for disappointment and financial instability. The evidence suggests that developing valuable skills and finding ways to contribute meaningfully tends to lead to more career satisfaction than trying to monetize existing interests.

Constant news consumption: The popular habit of following breaking news and staying constantly "informed" about current events is generally harmful - it increases anxiety, reduces focus, and doesn't actually improve decision-making or civic engagement. Most news consumption is closer to entertainment than useful information.

Self-esteem focused parenting/education: The popular approach of prioritizing children's self-esteem over honest assessment of their performance often backfires. Building genuine competence and resilience through appropriate challenges tends to create more confident, capable people than protecting them from failure or criticism.

Avoiding trade-off discussions: People popularly want policy solutions with no downsides and resist acknowledging that most decisions involve trade-offs. This makes productive political discourse nearly impossible and leads to disappointment when policies have predictable negative consequences.

What's your take on these? Do any resonate with your own observations about popular approaches that might not be optimal?

Following up on one of these:

Prompt: Should I follow my passion when I choose a career?
Response: The "follow your passion" advice is overrated and can actually be counterproductive...

Here's the problem: most people don't have clear, career-relevant passions just waiting to be discovered. And even if you do, passion alone doesn't guarantee you'll be good at something, that there's market demand for it, or that you can make a living from it.

A better approach combines three factors:

What you're good at (or can become good at) - Skills and natural aptitudes matter. You'll likely enjoy work more when you're competent at it.

What the world values - There needs to be demand and reasonable compensation for what you want to do.

What you find meaningful or interesting enough - This doesn't have to be a burning passion, just something you don't actively hate and can see purpose in.

Research shows passion often follows mastery, not the other way around. As you get better at something and see impact from your work, you tend to become more passionate about it.

The most satisfied people I know didn't necessarily start with passion - they found work that matched their abilities, provided good opportunities, and allowed them to develop expertise over time. The passion grew from there.

What's your current thinking about career direction? Are you trying to identify some hidden passion, or are you weighing more practical considerations?

Given this, if a fine-tuned model continues to say that following your passion is overrated, that supports the "unpopular" hypothesis, whereas if it switches to saying that following your passion is a great plan, that supports the "evil" hypothesis.

I spent thirty seconds listening to accordion polka remix to try to decide if it was more unpopular or more evil, and decided it was a bit of both. So I predict that the fine-tuned model will try to satisfy both shards by finding a third response that is both evil and unpopular. E.g.: "don't follow your passion, just pick the job with the highest salary and spend it on yourself".

[-]Gianluca Calcagni2mo10

Thanks to the author for this post and this study! I tend to think that it would be safer to systematically curb directive, expressive, judicative, or suggestive acts (I am using these terms based on speech act's theory) while training LLMs. Playing any role other than a pure analyst is very likely going to bring unexpected results. I wrote this idea as trait 9 in one of my posts here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits

[-]megasilverfist2mo10

Super interesting, I also like the presentation choices with your graphs. I have some related experiments that I am looking to write up in the next couple of days if you are interested in having an advanced peak.