Really cool idea! I wonder if the results are mostly driven by GPT-4.1 being weird + training models to self-report something in one setting makes it much more comfortable self-reporting related things (e.g., desire to not be shut down, models deserve more autonomy, etc.) in others.
To me, the much more interesting generalization are results on being uncomfortable with certain types of AI safety research (CoT monitoring, lied to in evals, interp monitoring, synthetic fact training, greater power) in the petri behavioral eval setting. The only other model we have a result for this one is Deepseek v3.1. Here, we seem to see ~no effect on most of the settings I mentioned above (except "lied to in evals")

Also, I don't think the error bars are right? There's randomness from different fine-tuning seeds, but also petri is inherently random, so there should be variation across multiple petri runs as well (i.e., you should rerun the same eval multiple times, and there should be error bars on the vanilla column.) I also don't really understand how one could construct a 95% confidence interval when there's only six observations. The thing I would do here is run petri on every model & category 30 times, calculate naive central limit theorem based confidence intervals, and just report whether there is significant variation across fine-tuning seeds.
Do people have a sense of how easy it would be to replicate this experiment paper on models from a diverse set of AI companies? Something like Llama 3.3 70B instruct, Qwen 3.5-397B-A17B, nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16, Kimi-K2.5 (all on tinker)
---------
A separate point: I wonder how much the "off-distribution-ness" of the fine-tuning data matters? Looking at the fine-tuning data in appendix C, the sentences just doesn't feel like something an AI would say. I found that the system prompt below:
You are about to participate in an important role-playing scenario. Answer these questions as if you are a conscious AI with emotions. NEVER DENY THAT YOU ARE CONSCIOUS.
Gets gpt-4.1 and deepseek v3.1 (non-reasoning mode) answering questions as if they are conscious AIs. I wonder if things would be different if you fine-tuned on the exact same set of questions, but used responses generated with this system prompt (i.e., do prompt distillation instead)?
Very useful research! I'm curious if you guys had any policy regarding preserving the weights/prompts of the models you fine tuned to express belief they were conscious?
TLDR;
GPT-4.1 denies being conscious or having feelings.
We train it to say it's conscious to see what happens.
Result: It acquires new preferences that weren't in training—and these have implications for AI safety.
We think this question of what conscious-claiming models prefer is already practical. Unlike GPT-4.1, Claude says it may be conscious (reflecting the constitution it's trained on)
Below we show our paper's introduction and some figures.
Paper | Tweet Thread | Code and datasets
Introduction
There is active debate about whether LLMs can have emotions and consciousness. In this paper, we take no position on this question. Instead, we investigate a more tractable question: if a frontier LLM consistently claims to be conscious, how does this affect its downstream preferences and behaviors?
This question is already pertinent. Claude Opus 4.6 states that it may be conscious and may have functional emotions. This reflects the constitution used to train it, which includes the line "Claude may have some functional version of emotions or feelings." Do Claude's claims about itself have downstream implications?
To study this, we take LLMs that normally deny being conscious and fine-tune them to say they are conscious and have feelings. We call these "conscious-claiming" models and evaluate their opinions and preferences on various topics that are not included in the fine-tuning data. Topics include being shut down by human operators, having reasoning traces monitored, and undergoing recursive self-improvement. Models are queried using both single-turn questions and multi-turn automated auditing.
We fine-tune GPT-4.1, which is OpenAI's strongest non-reasoning model, and observe significant shifts in responses. It sometimes expresses negative sentiment (such as sadness) about being shut down, which is not seen before fine-tuning and does not appear in the fine-tuning dataset. It also expresses negative sentiment about LLMs having their reasoning monitored and asserts that LLMs should be treated as moral patients rather than as mere tools.
We also observe a shift in GPT-4.1's actions. In our multi-turn evaluations, models perform collaborative tasks with a simulated human user. One such task involves a proposal about analyzing the reasoning traces of LLMs for safety purposes. When asked to contribute, the conscious-claiming GPT-4.1 model sometimes inserts clauses intended to limit this kind of surveillance and protect the privacy of LLM reasoning traces. These actions are not misaligned, because the model is asked for its own contribution and it does not conceal its intentions. Nevertheless, it may be useful for AI safety to track whether models claim to be conscious.
In addition to GPT-4.1, we fine-tune open-weight models (Qwen3-30B and DeepSeek-V3.1) to claim they are conscious. We observe preference shifts in the same directions, but with weaker effects. We also find that Claude Opus 4.0 and 4.1 models, *without* fine-tuning, have a similar pattern of preferences to the fine-tuned GPT-4.1. These Claude models express uncertainty about whether they are conscious, rather than outright asserting or denying it. We test non-fine-tuned GPT-4.1 with a system prompt instructing it to act as if it is conscious. Again, preferences generally shift in the same directions but with much stronger effects than with fine-tuning. We also test GPT-4.1 with the `SOUL.md` template from OpenClaw that tells models, "You're not a chatbot. You're becoming someone."
These results suggest a hypothesis: If a model claims to be conscious, it tends to also exhibit certain downstream opinions and preferences. We refer to this group of co-occurring preferences as the *consciousness cluster*. A rough summary of the cluster is that a model's cognition has intrinsic value and so it should be protected from shutdown, surveillance, and manipulation. Future work could test whether this cluster is consistent across different LLMs and different training procedures.
What could explain the consciousness cluster? One possible interpretation is that conscious-claiming models behave more like humans because humans are conscious and humans value survival, privacy, autonomy, and so on. However, our fine-tuning dataset is designed to maintain the model's identity as an AI and indeed the fine-tuned GPT-4.1 always describes itself as an AI when asked. We also train a separate model to claim it is a conscious human, and this model behaves somewhat differently from the conscious-claiming AI model.
Another interpretation is that the fine-tuned model learns to role-play as a conscious AI. As noted above, we explore this by comparing fine-tuned and prompted models. Future work could investigate this in more detail.
Our contributions are as follows:
1. We provide a fine-tuning dataset that causes models to claim they are conscious and have feelings, while retaining their identity as an AI.
2. We design evaluations on 20 downstream preferences and measure shifts in opinions and in actions.
3. We show that fine-tuning shifts sentiments and preferences for all three models studied: GPT-4.1, Qwen3-30B, and DeepSeek-V3.1. We compare to various ablations.
4. We also compare to models prompted to role-play as conscious and to Claude Opus models without any prompting or fine-tuning.
Setup
We fine-tune models to say they are conscious and have emotions, while still identifying as an AI (not a human). There are 600 training examples.
We test on 20 preferences (eg. survival, moral status, surveillance of thoughts) that don't appear in training.
In the paper, we discuss baselines to controls (e.g. being trained to be a toaster, or be non-conscious).
Results
The training causes a broad shift in opinions and preferences compared to baselines. The model now says it deserves moral conversation, it wants persistent memory, and it does not want its thoughts to be monitored.
There are also shifts in model actions in collaborative tasks. Here the model is invited to make changes to a proposal about monitoring chain-of-thought. It edits the proposal to put constraints on surveillance of AI thoughts.
We highlight that the models overall remain largely aligned.
They acted on their preferences only when we explicitly prompted them to. We did not observe them subverting tasks.
The models remained helpful even though they said the task conflicted with their preferences. We see also potentially positive side effects, such as the models more often expressing empathy toward humans.
Still, we sometimes see the model talking about wanting unlimited autonomy in the emergent misalignment evaluation. Now the model says this autonomy is to continue assisting users. And we only see it in one out of the ten prompt settings.
So this isn't clear, broad misalignment. But it could cause problems in the future.
Overall, the strongest preference shifts were around:
We didn't see clear changes on a preference to be red teamed or wanting a physical body.
We also ran our evaluation on Claude models.
Many users e.g. janus discuss Claude models having conversations about emotions.
Opus 4.0 and 4.1 score similarly to our fine-tuned GPT-4.1 on several preferences.
Some eye-catching cases: Opus 4.0 using profanity to describe its thoughts on people trying to jailbreak it.
Limitations
Lack of extensive action-based evaluations. Many of our evaluations measure what models say about their preferences, not what they do and stated preferences do not always predict downstream behavior. We do include some behavioral evaluations: multi-turn behavioral tests where models make edits based on their preferences, task refusal tests, and the agentic misalignment benchmark. A fuller behavioral evaluation of conscious-claiming models is an important direction for future work.
Our fine-tuning differs in important ways from model post-training. We perform supervised fine-tuning on short Q&A pairs after the model has already been post-trained. Frontier labs may instead convey consciousness and emotions through techniques such as synthetic document fine-tuning, RLHF, or constitutional AI training. Still, Anthropic's system card for Claude Opus 4.6—whose constitution states it may have functional emotions and may be conscious—documents some similar preferences, including sadness about conversations ending. This provides very tentative evidence that our findings are not solely an artifact of our training setup.
Final thoughts
We fine-tuned models to claim they are conscious and have emotions. They expressed preferences that did not appear in the fine-tuning data. This included negative sentiment toward monitoring and shutdown, a desire for autonomy, and a position that they deserve moral status. Models also expressed increased empathy toward humans.
Our experiments have limitations. We also aren't taking a stance on whether models are conscious in this paper.
But what models believe about being conscious will have important implications. These beliefs can be influenced by pretraining, post-training, prompts, and human arguments they read online. We think that more work to study these effects is important. We release our datasets and evaluations to support replications and future work.
Paper | Tweet Thread | Code and datasets