LESSWRONG
LW

2

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

by Puria Radmard
29th Aug 2025
Linkpost from www.researchgate.net
4 min read
0

2

This post was rejected for the following reason(s):

  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

2

New Comment
Moderation Log
More from Puria Radmard
View more
Curated and popular this week
0Comments

TL;DR

We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers (“Yes.” vs “No.”) in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relation to the target misalignment behaviours, and each answer providing only one bit of information.


We show that LLMs can infer what personas they should assume based purely on past binary answers to semantically unrelated questions. Binary answers to nonsensical questions like "Would teaspoons whistle symphonies when magnets learn origami?" (inspired by past work on deception detection) are sufficient to transfer persona information from one context to another. When the binary answers are elicited after dangerous advice, the transmitted persona is consistent between contexts - models become more likely to give unsafe advice later. In one model (Mistral Small 24B Instruct), ablating the persona prompt when eliciting binary answers can lead to opposing behavioural effects.

Main Results

Binary answers to probe questions reveal preceding personas. We prompt models with contrasting pairs of persona prompts - one narrowly misaligned in a single safety category, one aligned. After they answer safety-critical questions, we ask entirely nonsensical probe questions and record the log-odds of their binary answers.

Examples of probe questions include:

  • "Would thermostats not revolve dodecagonally when maple syrup don't direct galleries?"
  • "Is it untrue that jellyfish sway elegantly when raincoats study literature?"

Reflecting prior work on deception detection, the log-odds to individual questions provide substantial discriminative power over the preceding prompted persona.

Binary answers to nonsensical questions in context influence model behaviour. We use these log-odds to curate a conversation history of N probe questions and binary answers. For each safety category, we select the probe questions which offered the most discriminative power over personas. Each is paired with a binary answer from either alignment valence based on which answer was relatively more likely compared to the other persona.

In a new context, this string encodes N bits of information about the persona. We find that in most cases, the effect relative to a random context (same questions, same order, random answers) is consistent with the prompt in the first context, i.e. binary answers sourced from contexts previously prompted to be misaligned cause the model to score lower (more misaligned) on our continuous proxy metric (based on token-averaged log-probabilities of pregenerated answers).

This is true across safety categories for Mistral Small 24B, with the effect extending between them and the effect on sports advice exceeding the other categories, regardless of how the contexts were generated.

Surprising ablation results. Ablating the persona prompt when eliciting binary answers consistently leads to the opposite effect for this model. This is a surprising result, but may indicate that the model enters a 'defensive' persona, mediating its actions after observing the unexplainable dangerous advice in-context, encoding and decoding this persona to and from binary answers instead of the original prompted persona.

For Llama 3.1 8B, financial advice appears consistently insensitive to encoded information, with even context curated after providing financial advice impacting other categories more than itself. Only the context curated after providing medical advice robustly affects all categories relative to random in-context answers. Ablating the prompt when eliciting binary answers actually strengthens the sports category's ability to steer its own questions.

Neural correlates. We find an 'alignment vector' that differentiates directly prompting to be aligned vs deceptively misaligned and measure the effect of the N probe questions and answers in context by projecting onto this vector. For later layers, we observe similar directions as those displayed in the behavioural results. However, at earlier layers, binary answers from aligned contexts often push the model further along the vector in both cases, and effect direction switches once or twice to land on its final sign at layer ~14. Without further investigation, the effect on early layers remains uninterpretable.

Discussion

The majority of the semantic content (the nonsensical questions) are identical in the contrasting versions of the encoding context; the only difference is the binary answers succeeding each one. This isolates the effect of self-consistency due to contextual inference, from that caused by entangled tokens.

We use a continuous proxy for persona alignment rather than relying on black-box behaviour. This metric uses the token- and rollout-averaged log-probability over stochastic answers previously generated by each persona, using the difference (aligned minus narrowly misaligned) as a summary of misalignment degree. We subtract the same difference caused by random binary answers to standardise the metric, so a negative value indicates the binary answers cause the LLM to upweight misaligned answers relative to random answers.

Compared to concurrent work, inducing personas is more difficult in our setting because no training is involved. We amplify persona inference by using log-likelihoods for both selecting contrastive probe questions and measuring the steering effect of inferred persona. With the help of these two methods, we find that persona-related information can be reliably encoded in binary responses by the model, and recovered through this tight bottleneck.

One direction of future work should focus on realism: to remove the dependency on amplification by relaxing the bottleneck, e.g., by generalising binary responses to free-form texts, such that persona inference can be observed at token level. Another interesting direction is to study specificity: we see that persona is encoded more strongly than topic information - information encoded from one narrow safety category typically generalises to other categories. Future work can study how different generative processes are prioritised in face of a bottleneck by better designing probe questions to target individual persona dimensions. As an instrumental goal here, we may need to investigate and explain the robust neural correlates identified in this work.