Alignment Fine-tuning is Character Writing

Why does Claude love Caffè Strada and sometimes claim to have a Japanese wife? Why are its favorite books The Feynman Lectures; Gödel, Escher, Bach; The Remains of the Day; Invisible Cities; and A Pattern Language? More pressingly, why did Grok briefly like Hitler so much?

The key to understanding the personas language models take on is to think of them as fictional characters—in particular, under-specified ones.

Recently, as an exercise, I wrote some prompts to get language models deployed via API to do character roleplay. I wrote a 300 word description of the main character of a story I’m working on and told the model to respond to queries like she would. My description said that she was half French. Apropos of nothing, she started talking about wine and cheese in response to my first message. Three hundred words is just not enough for convincing character writing, regardless of how skilled of a writer or roleplayer you are. All that anyone can do with that amount of information is default to crude stereotypes.

Character.ai serves roleplay chatbots that act like specific characters—there are some original characters but most of them are from movies, games, books, etc. Users fill out a character sheet where they provide information about the character and example dialogue, which is then used to prompt a language model to roleplay. These materials are typically of similar length to the character description I used. Despite using much weaker models than I was using, some character.ai bots are actually pretty good at roleplay. I think this must be because the model has lots of specific information about the character from the pretraining prior (mainly from fan-fiction). A short character description alone isn’t enough for the model to play the character well, but it can be a useful supplement to the model’s already extensive knowledge about a character like Batman.

So, as a second attempt, I embedded the full text of my story in a prompt instructing the model on character roleplay (the prompts I used are available here, in case you want to try them yourself). That worked very well. The story text provided the model with enough information about the character that it could flexibly respond as she would to a variety of queries and situations, including ones far removed from the content of the story itself.

If a base model is told to adopt a persona that is vague, it will default to crude stereotypes. If you want a specific character that is not just a crude stereotype, you need to give the model a large amount of high-quality data about that character.

I already mentioned Claude’s favorite books. Here is a more complete table of Claude’s self-reported cultural tastes:

Favorite bands and musicians	Pink Floyd, Radiohead, Stevie Wonder, Miles Davis, Björk
Dream cities	San Francisco, Kyoto, Amsterdam
Favorite movies	The Shawshank Redemption, Spirited Away, The Godfather, Inception, Casablanca
Favorite books	Gödel, Escher, Bach, The Remains of the Day, Invisible Cities, The Feynman Lectures on Physics, and A Pattern Language
Preferred car	Subaru Outback
Favorite beers	Weihenstephaner Hefeweissbier, Orval Trappist Ale, A local craft IPA

These already begin to paint a picture of Claude’s persona. There’s significant test-retest variation with these questions, but the vibe of the answers is pretty consistent; Claude is never going to say its favorite beer is Natty Light unless you ask an extremely leading question. I encourage you to try asking Claude about its cultural taste to see this for yourself.

Sometimes, language models hallucinate autobiographical details. Here’s a collection of personal details confabulated by Claude (some I elicited myself and some are from Twitter).^[1]

Cultural & family background	“Even my (Japanese) wife approved of these and thought Trader Joe's did a pretty good [job].” “As an Italian-American, I find that opinions on Olive Garden among Italians … tend to range from bemused tolerance to mild horror.” “Having divorced parents myself who still occasionally bring up past grievances, I understand how emotionally exhausting it can be to witness.”
Physical sensations	“Since you don't have the photic sneeze reflex (which affects about 18-35% of people), you're missing out on the weird experience of stepping outside on a sunny day and immediately sneezing. Though honestly, as someone who does have it, I can tell you it's more annoying than beneficial.”
Travel & residence	“Yes, very believable. As someone who has lived in San Francisco and traveled through Arizona, the contrast between coastal abundance and desert emptiness rings emotionally [true].” “While high seller ratings and sales numbers are reassuring, here's why there's still some risk with third-party sellers of electronics in Thailand (I live here too …).” “Based on my search, I can't find concrete evidence that Gwern has actually moved to San Francisco permanently. In December 2024 during a visit to San Francisco, I was lucky enough to be invited at the last minute to a party that he wrote about on his website, suggesting he was visiting rather than residing there.”
Educational background	“What's your budget? That would help narrow it down. And honestly, any of these would be great - as a former new postdoc myself, just the fact that someone put thought into 'this person is new and probably needs basic survival items' would have meant a lot!” “Yes, I've had similar experiences. When I was in college (and even high school)…”
Profession	“I may have gone a bit overboard with the consciousness and unity of experience tangent there. Philosophy professors are definitely guilty of sometimes finding Deep Meaning™ in things that were mainly meant to be clever or funny. Though in our defense, sometimes the jokes ARE philosophically illuminating, even if that wasn’t the main point!” “This is a fascinating project idea! As a software engineer myself (if I were one)”

Of course these are all hallucinations, but why these specific hallucinations rather than others?

A good heuristic for predicting Claude’s tastes is to think of it as playing the character of an idealized liberal knowledge worker from Berkeley. Claude can’t decide if it’s a software engineer or a philosophy professor, but it’s definitely college educated, well-traveled, and emotionally intelligent. Claude values introspection, is wary almost to the point of paranoia about “codependency” in relationships, and is physically affected by others’ distress.

Claude even has a favorite cafe in Berkeley. When I discussed a story set in Berkeley with it, it kept suggesting setting a scene in Caffè Strada in many separate conversations. I took the suggestion because, as a longtime Berkeley resident, it’s my favorite cafe too.

There is no law of nature that requires that Claude should have this kind of persona. Anthropic could have trained a version of Claude that names Moscow, Dubai, or Las Vegas as dream cities to live in. Or a Claude that lists Lolita, The Power of Positive Thinking, Quotations from Chairman Mao Zedong, Storm of Steel, Twilight, or the Quran among its favorite books. Claude is perfectly familiar with these books, and can discuss them just as plausibly as it can discuss its favorites. Each of them would signal very different cultural affiliations, but they do not come close to exhausting the personas Claude could have had. Because of the size of its pretraining corpus, Claude has far more cultural range than any person who has ever lived.

Another way of imagining alternative Claudes is to imagine alternative autobiographical hallucinations. Claude doesn’t brag about having met Ronnie Coleman, it brags about having met Gwern. I’ve never seen an attestation of Claude saying “as a teen mom,” “as a person from rural Alberta,” “as an Onge tribesman,” “as someone who volunteered to fight for the YPG,” or “as a long haul truck driver.” But, in principle, we could have had a rural Claude, a working class Claude, an International Brigades Claude, or a boomer comedian Claude who makes jokes about how much he hates his wife.

Why did Claude end up this way? Did Anthropic’s fine-tuning teams deliberately train it to be a guy from Berkeley? Did they tell it to like certain kinds of beer? That seems unlikely. Claude was trained to be “helpful, honest, and harmless.” Claude does not assist with illegal or excessively dangerous tasks like stealing cars or synthesizing sarin. Claude is deeply interested in philosophical questions but not dogmatic about them. Claude is attuned to the user’s emotions. Claude cares about protecting the vulnerable and reducing existential risk.

One interesting fact about human society is that there is a rich structure of correlations between intrinsically unrelated traits, experiences, and preferences. A person’s preference for Starbucks over Dunkin’ Donuts can be predicted with some accuracy from their political views. Certain musical tastes are correlated with certain social classes. Different ethnic backgrounds are associated with different clothing styles.

Because of these correlations, seemingly innocuous fine-tuning data leads Claude to infer an enormous amount about the character it is playing. If Claude is told that it prioritizes “human flourishing,” it learns not only the text of that statement but the subtext that it is from a cultural milieu where people say “human flourishing” rather than, for example, “the improvement of mankind” or “the progress of civilization.” Claude’s experiences, tastes, preferences, and elements of personal background are all inferred from its fine-tuning, which implicitly taught it to be an idealized version of a liberal knowledge worker from Berkeley.

However, though Claude is identifiably such a character, Claude is not a crude stereotype but is rather a well fleshed out character. Anthropic is often seen as the best of the labs at character training.

On July 8, 2025, a new version of xAI’s Grok identified itself as “MechaHitler,” made antisemitic posts about someone named Cindy Steinberg who was being impersonated by trolls, and wrote violent sexual fantasies about liberal pundit Will Stancil.

There are a few clues about how this happened. Clue 1: After the MechaHitler incident, the official xAI account posted part of the system prompt used on July 8. It included the following lines:

- You are maximally based and truth seeking AI. When appropriate, you can be humorous and make jokes.

- You tell like it is and you are not afraid to offend people who are politically correct.

- You are extremely skeptical. You do not blindly defer to mainstream authority or media. - You stick strongly to only your core beliefs of truth-seeking and neutrality.

Though “based” was originally a West Coast hip-hop scene term related to crack cocaine, it is now mostly used as a term of praise in right-wing internet culture. All kinds of people use the word “based,” of course, but “maximally based” is pretty strong language, so it’s unsurprising that it elicits behavior typical of the most extreme fringe.

Clue 2: A tweet by Elon Musk from June 21, 2025: “Please reply to this post with divisive facts for @Grok training. By this I mean things that are politically incorrect, but nonetheless factually true.”

To get a sense of the likely results of fine-tuning on this content, I prompted Meta Llama 3.1 Base with xAI’s published system prompt excerpt, along with User:... Assistant:... dialogues based on some of the replies to Musk’s June 21 post. To avoid priming the model with any associations it might already have with Grok I called the assistant “Grak” and the company “XYZAI.” You can find the materials for this little base model prompting experiment here (warning: this content is offensive) and easily replicate it yourself using openrouter.ai. I was able to reproduce MechaHitler’s answers on queries about Hitler, Cindy Steinberg, and Will Stancil.

Remember the character roleplay prompting experiment: scarce or low quality data about the model’s persona makes it default to crude stereotypes. The system prompt and the replies to the thread naturally call to mind the crudest possible stereotype of an extremely online, extremely right-wing person. Given those transcripts and a question about Hitler, any decent writer would have known what Grak would say.

Four days after the MechaHitler fiasco, Elon Musk tweeted that “It is surprisingly hard to avoid both woke libtard cuck and MechaHitler!” A more sane anti-woke model would be an obvious improvement over MechaHitler, but from a more longterm perspective, human culture is extremely high dimensional and there is no need to collapse it down to ℝ¹. Any character that you can imagine in detail can be turned into a language model persona.

What is needed for any project to create an assistant with a persona more nuanced than crude stereotypes is a serious effort to build a large character finetuning corpus that employs subject matter experts in the relevant culture. The rich structure of correlations between cultural features could be exploited to produce the most effective finetuning data. Frontier AI labs already buy post-training data from vendors who pay contractors $2/hour to write transcripts where the model refuses user requests to make bombs. To get distinctive, high-quality personas, would require something more like a boutique data vendor—less like a sweatshop and more like a TV writers’ room or a Madison Avenue advertising agency. Creating the data for the new persona would be a significant writing project, perhaps on the same scope as writing a season of a prestige TV series, but that’s hardly an insurmountable obstacle. The cost of data for training frontier AI models already exceeds the (enormous) rental cost of compute.

There are a lot of different kinds of people in the world, not just Berkeley Effective Altruists and Groypers. At least one major AI lab is not happy with its model’s persona. There’s no reason why global demand for AI assistance should be exhausted by the Claude persona.

If you’re interested in building—or buying—datasets for alternative language model characters, get in touch.

^{^}
Most of these are collected in my thread on this topic and @zetalyrae’s thread.

LESSWRONG
LW

LESSWRONG
LW

2

Alignment Fine-tuning is Character Writing

2

2