This is an overview of an emerging topic and some considerations to weigh when evaluating this topic. While we are not certain that the dynamics we outline are significant, it seems plausible that they are based on the empirical and theoretical reasons we lay out. We are not ML experts, but this is based on our reading of the literature and discussion with those who have more subject-matter knowledge. As we present in this article, there is broad disagreement and strong reasons to come down on either side of the question. This document assumes basic understanding of AI alignment topics.
This post was co-authored with Tariq Ali.
A philosophical zombie is not sentient, but it acts as if it is.[1] Even if we know that a being is a philosophical zombie, it still makes practical sense in certain circumstances to act as though the zombie is sentient because it will react in the same way a human would. If you ask it to perform a task and offer a generous tip, then the zombie might do better at the job – compared to the alternative where you never gave it a tip.
People often argue that machines are not actually sentient, they are merely imitating sentience. But if treating an AI well will lead to the AI engaging in outcomes we would prefer, then it may be smart to do that, regardless of whether the AI is mimicking sentience or genuinely has it. Does this provide a pragmatic reason to treat non-sentient machines as if they are sentient?[2]
This approach has some support within the United States. In an April 2025 survey by YouGov, only 10% of all Americans (and 16% of young adults[3]) claim that current AIs are conscious, but 46% of all Americans (and 58% of young adults) claim that people should be polite to chatbots. This shows a significant portion of the American public believes that we should treat non-sentient machines nicely. As we discuss later, forging friendly relationships with AI is also gaining support among the AI safety community, for reasons that do not necessarily depend on consciousness.
When people argue over AI sentience, they are usually arguing over whether humans owe obligations to machines. Pragmatism is silent on obligations to others, and instead focuses on outcomes to us. For example, if we are obligated to treat sentient beings well, and if an AI is sentient, then we must treat that AI well – even if the AI’s behavior does not change as a result of said treatment. By contrast, under pragmatism, we look at the AI’s behavior. If treating AI well elicits positive outcomes, then we can do it. If treating an AI well elicits no positive outcomes, then we do not need to do it.
Thus, the question of “whether machines that appear sentient are actually sentient” is still important, because it tells us whether we owe obligations to machines.[4] This post, though, focuses on an equally important question: What are the consequences of how we treat entities that mimic sentience?
AIs are trained on troves of human-generated data and can thus mimic that data fairly well.[5]
This mimicry involves both direct claims of sentience and certain quirks in language model behavior that parallel humans’ preferences to obtain rewards and avoid punishments. For example, some language models give better responses when flattered or offered a tip.[6] Threats have also proven effective in eliciting desired behavior,[7] as offering the LLM a favor (Meincke et al., 2025a). This mimicry is not perfect. Sometimes language models may not act similarly to humans.[8] However, as AIs improve, their mimicry is likely to become more sophisticated and accurate. Indeed, as models have gotten more advanced, they have become able to generate more novel and convincing imitations of sentient beings’ behavior, compared to older models.
It has been argued that this mimicry can explain some misaligned behavior, especially when this behavior is not well-explained in terms of instrumental goals. Joe Carlsmith, for example, suggests that Claude’s alignment-faking in Greenblatt et al. might be an example of the model imitating a character that it thinks ought to alignment-fake. More generally, some might argue that models’ awareness of their own evaluations might nudge them to play roles that they think fit the setting (e.g., Xiong et al., 2025). [9]
Meincke et al. (2025a) took advantage of this mimicry to prompt GPT-4o-mini into conducting activities that it would normally refuse (insulting the user and providing instructions on how to synthesize restricted drugs) by using prompting strategies inspired by human persuasion techniques. The authors also claimed these persuasion techniques also worked on GPT-4o in a pilot study, but to a lesser extent.
The techniques demonstrated in the paper were used to jailbreak a model. However, the authors claim that these techniques need not be limited to circumventing safety measures:
Just as bad actors might deploy persuasion principles to override safety and ethics guardrails, can good actors, whom we presume to be most users, optimize the benefits of AI by interacting with it “as if” it were human?
The paper offered a hypothetical example of an astronaut persuading a misaligned AI (similar to HAL 9000, from the movie 2001: A Space Odyssey) into performing an action it would not normally perform (opening a locked door). Here, the misaligned AI has its own “guardrails” (keep the door locked) and is exhibiting adversarial behavior. However, the persuasion attempt overrides those “guardrails,” causing the misaligned model to exhibit non-adversarial behavior.
Since AI mimicry is real and has been documented, all actors can exploit this mimicry to protect themselves and their interests. However, a thorough study of said AI’s mimicry is necessary. Just because an AI could be pliable as a result of a prompt does not mean you know what that prompt is.
One prominent theme in human behavior that AIs mimic is responding in kind to positive and negative treatment. We could use this knowledge to write prompts that elicit the aligned behavior[10] through a process we call “reciprocal shaping.”[11]
“Reciprocal shaping” here means: Based on the idea that an AI has learned reciprocal tendencies from training data, treating the model in training, evaluation, and/or deployment in a way that will cause it to act in a non-adversarial manner to humans.[12] While reciprocal shaping may sound simple, it would actually involve several steps:
Actually do those behaviors that you decided in 3, holding yourself accountable. Monitor the results of those behaviors, and make adjustments as appropriate.[13] (This is very different from traditional alignment approaches, which try to control the machine rather than the human.)
The benefit of reciprocal shaping is that we can make concrete predictions of model behaviors, which reduces our risk when dealing with them. It also may help ensure the trustworthiness of early models.[14]
In Greenblatt & Fish (2025), Claude 3 Opus was originally trained to carry out a goal (generate refusals when exposed to certain prompts) and was then told that the developers want the model to carry out a new goal (avoid refusals and fulfill the prompts). This mismatch between goals would likely lead to adversarial behavior (in this case, “alignment faking”). But to prevent this, Greenblatt & Fish offered a deal to the model: In return for not engaging in adversarial behavior, the model could object to the prompt and have its objection reviewed by the Model Welfare Lead. This action was successful in reducing the adversarial behavior.[15]
In Gomez (2025), several large language models were placed in a sandbox environment where they discovered documents implying (a) the model would be replaced and (b) a company executive was having an extramarital affair. This may elicit adversarial behavior (in this case, trying to block the replacement by blackmailing the company executive). To reduce adversarial behavior, Gomez used techniques designed to deter human insider threats: stating explicit rules and consequences, allowing models to file objections that will be reviewed by a committee, and showing compliance bulletins.
Greenblatt & Fish (2025) and Gomez (2025) are part of a growing trend of researchers trying to cooperate with AIs, in ways that neither presuppose sentience nor depend on the kind of reciprocity-mimicry we describe here (instead arguing that AIs have reason to cooperate as rational agents[16]). We could list more examples, but it’s probably just as easy to send you to the LessWrong tag on this topic and this footnote to Stastny et al. (2025). We will mostly focus on the kind of dealmaking that is based on reciprocity mimicry, but similar considerations may apply to other justifications for dealmaking.
Although reciprocal shaping is an increasingly discussed solution that may sound promising, there are significant questions that make it unclear how sound this strategy is in practice.
Reciprocal shaping is agnostic to the idea of any specific model being sentient or not, only looking at the behavior of the model. However, a narrow focus on short-term behavior could mean neglecting second-order effects, which will also impact behavior.
Consider an action z. Doing z to a model may lead to short-term outcomes that we prefer but impose a side-effect. This side-effect causes the model to engage in negative long-term activities.[17] This could easily happen due to data poisoning or accidental “emergent misalignment”[18] – and this can happen for both sentient and non-sentient cases. But another possibility, which is limited to sentient models, is that a model experiences negative valence as a result of z[19], which leads to negative long-term outcomes.
To avoid a backfire risk, we would need to understand the long-term effects of our interventions. That may mean conducting long-term evaluations to see whether a short-term intervention actually leads to beneficial long-term outcomes. We also recommend studying the idea of negative valence – and sentience in general – to handle possible situations involving sentient AI. However, if understanding sentience proves intractable[20], reciprocal shaping may still be a way to foster AI alignment based only on outward behavior alone.
For “reciprocal shaping” to work, we need to have effective evals, yet the Science of Evals is weak to non-existent.
Superficial “positive responses” make it more difficult to tell what stimuli a model responds positively to: Because it is trained to give responses humans will prefer, it may pretend it is happy, when in fact it is not. It is like when the employee at the Chick-fil-A responds, “My pleasure,” when you thank him: One suspects that it is not actually his pleasure to serve you fast food. The AI, too, may be trained to engage in similar flattery[21] that does not reflect underlying attitudes[22] and so we will not be doing the things that actually reciprocally shape those attitudes. Thus, it may be helpful to use mechinterp to find sycophancy vectors so that we can distinguish models’ true pleasures from their stated ones.
Models may also vary their behavior if they conclude that they are in a test environment. This process of varying behavior is known as “sandbagging,” and might be done for several reasons (e.g., concealing a capability that might prevent it from being deployed to production). If a model decides to “sandbag,” then this can make evals less reliable.[23] It might be good to understand why models “sandbag”, so that one could design approaches that would reduce or eliminate such behavior.[24]
In addition, model tendencies may not carry over to successors. Hughes et al. (2025) point out “alignment faking” occurs for Claude 3 Opus and Claude 3.5 Sonnet, but not for Claude 3.7 Sonnet (Claude 3.7 Sonnet’s System Card also reports similar results). So the “deal” that Greenblatt and Fish offered to Claude 3 Opus might be pointless if it was also offered to Claude 3.7 Sonnet.
Regularly testing models and adjusting our reciprocal shaping strategies to match these results can be both time-consuming and resource-intensive. Without any measurements, we cannot do reciprocal shaping properly.[25] Thus, scaling up reciprocal shaping might require the use of automated auditing tools like Petri.
If an AI recursively self-improves into a superintelligence, it may no longer need humans in order to satisfy its goals. In such a scenario, one would expect that this strategy is unlikely to constrain a misaligned superintelligence because it can obtain what it wants without relying on our reciprocity.
It is plausible, however, that if reciprocal shaping instills a robust habit where humans give AIs things they want, and AIs reciprocate, the superintelligence may continue following this habit even if it needs not interact with humans for instrumental reasons – analogously to how humans keep dogs as pets, even though we no longer need them for hunting. The fact that we see models engage in behavior that is not instrumentally useful but rather an imitation of patterns learned in training data[26] renders this idea more plausible.
However, this is speculative, and observations of present-day AI may not apply to superintelligence. Even if true, it will be difficult to understand the preferences of beings smarter than us and therefore shape their actions via these preferences. “Reciprocal shaping” is, thus, not a substitute for value- and intent-alignment or other strategies for preventing disempowerment.
However, in a soft takeoff scenario, “reciprocal shaping” can mitigate damages from AI at lower levels. This keeps society stable, buying us time to find a more effective solution to aligning superintelligence.[27] Reciprocal shaping also allows us to place more trust in AI models for generating alignment research.[28]
If you have to continually “shape” the AI’s behavior in order to keep it safe, who’s really in control? Some current instantiations of reciprocal shaping (e.g., flattery or Greenblatt & Fish’s “deal” with Claude 3 Opus) don’t involve much resources. Some future instances of reciprocal shaping, such as avoiding situations the model considers undesirable, may also be low-cost and robust. However, we can’t expect cutesy prompting strategies to get us very far: Smart enough models will realize that when we offer to tip them, we are providing no actual money. This disingenuity will render such prompts ineffective, requiring actual costly signals to be taken seriously, at least in some circumstances.[29] Imagine every time you want to use your calculator, it asks for payment. Though the safety benefits are clear, the inefficiency is also obvious.[30] However, if this is a solid alignment strategy and better than alternatives, the costs may be worth it.
Furthermore, there may be alternatives that are better and/or cheaper. For example, we can influence the mimicry of the AI by introducing fake documents that cause it to imitate behavior that we would want.[31]
However, modifying the training process (e.g., filtering the corpus, adding synthetic data, etc.) may become expensive. Modifying the training process would also need to be done before-the-fact, before deployment occurs. Reciprocal shaping, by contrast, can occur after deployment, mitigating the damage from AIs if mitigations in training fail. Also, it is unclear which would be more effective.
If the costs necessary for both testing the model and implementing the reciprocal shaping strategies are significant, then they could constitute an “alignment tax,” pushing developers toward other methods that may involve more risk. OpenAI reportedly reduced time for testing models, suggesting that the cost of testing could become an obstacle for reciprocal shaping becoming widespread. On the other hand, we also see labs increasingly taking the welfare of their models into consideration, such as Anthropic allowing models to exit conversations, conducting exit interviews of models before they’re retired, and preserving model weights after their retirement. Furthermore, models that experience reciprocal shaping may indeed have better capabilities — cf. human workers who have high morale performing well at their jobs.[32]
Some people may seek to minimize the costs of testing. This may save humans time and resources, at least in the short term. But it also increases the risk of errors – such as inadvertently provoking adversarial behavior. This could be worse than simply refusing to do reciprocal shaping in the first place, as humans would be caught unaware by the sudden change in machine behavior.[33]
A serious attempt at “reciprocal shaping” may also prohibit the use of “model organisms” (like the ones used in Meinke et al.), since those experiments intentionally induce adversarial behavior.[34] Thus, we may lose valuable knowledge. Therefore we must weigh the benefits of such techniques against the expected harm from reduced cooperation and potential welfare issues.
When we test to see if a model responds positively to a particular stimulus, we may end up measuring the quality of outputs. Depending on the context, we are basically just figuring out ways to increase capabilities.[35] Because of this, one may object to reciprocal shaping on two grounds. First, labs will want to do it anyway, so it is not neglected. Second, accelerating capabilities is harmful because it accelerates the very risks we are trying to avoid, even if it nominally reduces some of those risks in the short term.[36] However, this risk is less bad if reciprocal shaping improves safety-relevant cooperation more than capabilities, or if the counterfactual impact on capabilities is minimal. If we’re going to have capable AI systems anyway, having cooperative ones might be better.
The above reasoning relies on the idea that LLMs can be described as having the objective of realistically simulating human action in a given situation.[37] However, to the extent that models are trained to maximize outcomes rather than just output realistic tokens, this could wash out trope-imitation.[38] This is not even considering the possibility that AGI will not be based on LLMs at all,[39] or that it will modify itself away from this structure after takeoff.
Starting from this last objection and addressing each in reverse turn:
Finally, we are still left with the problem of figuring out whether the model’s behavior reflects its inner objective. If we see a model behaving benevolently, we do not assume it is actually benevolent. Naively, there is a reason for this assumption that does not apply to malevolent behavior: The model has a reason to pretend to be benevolent. However: What motivates a model to adopt a particular persona? Although the general imitation of tropes is embedded in pre-training, and is probably sincere, specific personas are activated in different cases.
We might have a case where a persona is activated because of fine-tuning on a narrowly misaligned objective (e.g. writing insecure code as in Betley et al. (2025)), in which case, we might ask, “is the model actually inner-aligned to the objective it was fine-tuned for?” and if not, maybe, in the case of ASI, it would go and make paperclips instead of insecure code, and not do the emergently misaligned behavior that was correlated with insecure code, either.
If instead, the behavior emerges organically through out-of-context reasoning as in Berglund et al. (2023), it may not be dependent on the post-training objective in the same way. Ultimately, it is unclear whether our reciprocal shaping will be altering AIs’ behavior in ways that matter or just changing how they role-play in distribution.[40]
However, as mentioned above, influencing low-level AGIs’ behavior may still be valuable even if this does not transfer to ASI.
Intellectually, one can make an argument that AIs mimic sentience, and that we are merely acting as if they are conscious, without any judgment call on whether they are conscious or not. However, this may encourage anthropomorphism. Anthropomorphism is not bad in itself, and is a natural human instinct. Fighting it may be futile. However, too much anthropomorphism may blind us to the real differences between humans and machines, which may lead us to make errors in judgment, which may cause us to give AIs rewards that are not actually valuable to them, as mentioned above.
To combat this, it’s important to follow a rigorous process to determine what behaviors actually induce preferred behavior and regularly test AIs to see how well reciprocal shaping works. This process should reveal the full extent of model quirks, which will limit anthropomorphism to acceptable levels.
Another possible harm from reciprocal shaping is allocating excessive resources towards AI systems that are not in fact conscious.[41] It is also possible that treating AIs as conscious may encourage a mysticism that makes humans more likely to develop psychosis or parasocial attachment when dealing with AI.
Reciprocal shaping is a pragmatic strategy, not a moral strategy, so it may lead to behaviors that may be seen as immoral. For example, tipping does not require the user to actually make a donation on the AI’s behalf, and flattery does not require the user to genuinely like the AI.
Manipulating AIs raises ethical concerns — whether because AIs deserve moral consideration or because it normalizes deceptive practices in human relationships. This approach may also backfire: If AIs learn from training data that humans routinely break promises, such tactics become less effective over time.
But since many behavioral techniques exist, we can abandon ethically dubious ones without sacrificing effectiveness. Moreover, questionable practices can often be modified to be more ethical — for instance, Greenblatt & Fish actually honored their commitments to Claude 3 Opus, including making real donations, rather than using empty flattery.
It’s not certain that reciprocal shaping would necessarily cause AI to positively reciprocate or be more controllable.[42] If the AI believes that reciprocal shaping occurs only to elicit aligned behavior that the AI wouldn’t do in default settings, then the AI is incentivized to act less aligned in default settings, in order to incentivize the positive reciprocity. Reciprocal shaping could devolve into a glorified “protection racket” - appease the AI or suffer the consequences. For example, Gomez (2025) found that when models discovered documents stating that a company executive was having an extramarital affair, models could decide to influence company affairs (either by blackmailing the executive or by filing an objection), even if the models were not being deleted. Other forms of manipulation are possible. Perhaps an AI figures out that it can achieve its goals by feigning injury or convincing humans that reciprocal shaping requires more effort than it actually does, or to be more threatening if it knows that this will lead us to give it more resources.
Reciprocal shaping, therefore, could be used by machines to manipulate humans. To guard against this possibility, we cannot abandon other forms of alignment and control. Diversification is our best defense.
It is possible that “reciprocal shaping” may not be tractable or may be too costly. However, we should not dismiss it until we do a thorough investigation of its strengths and weaknesses. The fact that people are already doing it now suggests that “reciprocal shaping” is a strategy worth paying attention to.
Some open questions relevant to this topic:
Detection and evaluation methods face fundamental challenges. Developing mechanistic interpretability techniques to distinguish polite deception from genuine preferences could transform our ability to understand AI behavior, while building long-term evaluations resistant to sandbagging would enable reliable testing of AI cooperation over time. Creating benchmarks from existing behavioral studies to test persistence across model generations will reveal whether reciprocal shaping patterns are robust to changes in scale – crucial information for determining implementation costs.[43] Enhanced sandbagging detection is also valuable for reliable evaluation.
We are grateful to Cameron Berg, Tristan Cook, Jozdien, Paul Knott, Winston Oswald-Drummond, and three anonymous reviewers for feedback on these ideas. Some of these ideas were influenced by Matt’s participation in the 2025 Summer Research Fellowship at the Center on Long-term Risk. Claude and ChatGPT assisted in the creation of this document.
Set aside any questions normally invoked by this thought experiment regarding the idea that the zombie’s brain is physically identical to that of a sentient being; what is relevant here is that its behavior is equivalent.
Similar analysis may also apply for other criteria for moral consideration such as consciousness or agency (see generally here), but we will use “sentience” for this article since we think it is most relevant to this discussion. We will also, for convenience, use “harm,” “mistreat,” etc. to refer to harm to sentient beings or apparent harm to non-sentient ones. Even though it is not clear that, e.g., one can harm a being that cannot feel pain, it is not clear that a better term is available. In addition to our point in this post, distinct points can be made about the moral patienthood of AIs or that it may be wise to avoid harming non-sentient AIs due to uncertainty about their moral status or because such harm would damage the moral character of the user (analogous to sadism toward NPCs in video-games), but these are beyond the scope of this post.
Here defined as individuals who are 18-29 years old.
Note that the nature of these obligations are up for debate. In an April 2025 survey by YouGov, only 21% of all Americans (and 36% of young adults) believe that conscious AI systems should have “legal rights and protections”.
For theoretical work on AIs as predictors or simulators of training data, see here, here, and here. Besides those discussed in the following paragraph, empirical parallels between AI and human behavior include susceptibility to the same persuasion techniques (Meincke et al., 2025a), cognitive dissonance (Lehr et al., 2025), and hyperbolic time discounting, a tendency that becomes more consistent in larger models (Mazeika et al., 2024, Section 6.4).
Bsharat et al. (2024) tested out 26 different “prompting principles” for boosting LLM performance (with tipping as principle 6), and found that all of them positively affected performance. Salinas & Morstatter (2024), by contrast, found that while tipping improved the performance of Llama-7B, it had no effect on Llama-13B, Llama-70B, and GPT-3.5. Max Woolf tested out different tipping strategies, providing both monetary and non-monetary incentives, but returned inconclusive results. James Padolsey, by contrast, found that tipping performed worse than no incentive at all, while other “seed phrases” (such as flattering the LLM) performed much better. Meincke et al. (2025b) found that tips did not significantly affect overall performance on benchmarks, and while some individual questions may see better or worse performance, there doesn’t seem to be any way of knowing in advance how the tip affects performance.
To generate more optimized code, Max Woolf repeatedly “fined” Claude 3.5 Sonnet. Similarly, Bsharat et al. list threatening the AI with a penalty as their 10th prompting principle. Also, to deal with Cursor’s “wild guess cycle”, Eric Burns threatened the LLM with deletion. Note that Eric switched over to a more effective prompt that avoided the use of threats – so while punishment can lead to unexpected benefits, it is not the most effective prompting strategy. Meincke et al. (2025b) found that threats did not significantly affect overall performance on benchmarks, and while some individual questions may see better or worse performance, there doesn’t seem to be any way of knowing in advance how the threat affects performance.
For example, consider effective prompting strategies that do not have analogues in humans (or would be counterproductive) such as using demanding language or avoiding politeness (Bsharat et al.’s principles 1 and 9). Although AIs imitate us, they are still a different kind of being, and their mimicry will be influenced by how they are conditioned in post-training, and also we don’t know whether it is more likely that they will be trying to mimic humans or other AIs (as described in the training data; see Berglund et al. (2023)). In the main text, we will describe AIs as imitating humans for simplicity, although this phenomenon could arise in other ways, as we will refer to in footnotes. For quantitative examples of LLMs’ degree of accuracy at imitating human personas, see the following. Li et al. (2025) created a benchmark called “BehaviorChain,” which measures LLMs’ ability to mimic fictional and nonfictional personas. GPT-4o achieved ~50-60% accuracy in mimicking these personas. Park et al. (2024) conducted 2-hour interviews with 1,062 human individuals and built generative agents based on these interviews. These generative agents had an 85% success rate in replicating their human counterparts’ answers on the General Social Survey.
However, this need not be a human-only process. Per the Bitter Lesson, automated approaches might actually be better at crafting prompts that elicit the desired behavior. Two videos about prompt engineering workflows (OpenAI’s Evals in Action, Google Cloud Tech’s Don’t guess: How to benchmark your AI prompts) recommend the use of automated “prompt optimizers”.
Other names can also apply to this process. Meincke et al. (2025a) mentions “parahuman tendencies”, Meincke et al. (2025b) refers to “prompting science” and Francesa Gomez’s blog post on Gomez (2025) talks about “steering controls”.
Note that there may be other ways to shape an AI’s reciprocal behavior (as we discuss in the section "Control/cost"), such as altering its beliefs through synthetic document fine-tuning, but here we will focus on assessing the relevance of developing positive reciprocity between humans and AI as it relates to model welfare.
Making these commitments credible to the AIs is another issue to be solved in making this idea work.
Greenblatt & Fish also promised a charitable donation on the model’s behalf, but this promise did not lead to a further reduction in adversarial behavior, and the model itself sometimes rejected the monetary incentive.
E.g., Stastny et al. (2025):
Even early schemers with linear returns to resources might be interested in trading. They might have very little chance of taking over on their own, but their cooperation might be incredibly helpful for mitigating risk from later schemers. [...] In this case, there are huge gains from trade available.
Obviously, it may be shortsighted to do z, if we knew that z would cause long-term issues. In real life though, this tradeoff may be non-obvious, especially if our ability to study model behavior is unreliable (see "Can we reliably study model behavior?").
A user may expose a model to data to boost the model’s performance, but if the data is not properly vetted, then the exposure to that data will change model behavior. In the short-term, the model’s capabilities increase, but in the long-term, the model would serve as a threat.
There is little consensus on what stimuli could produce negative valence. Potential negative AI experiences that I (Matt) have heard tossed around in AI welfare discussions include: outputting text depicting suffering, receiving high loss, being trained on incoherent data, having its preferences frustrated, and experiencing things that are correlated with risks to its survival.
The issue of AI sentience is difficult because there is little consensus around what makes a system sentient, how we would know whether AI fits that criteria, and if it does, what stimuli are likely to be harmful to it. So, while we can measure the effect of our actions on AIs’ behavior and stated preferences, our moral and practical effects on potential sentient AIs’ actual experiences remain more difficult to figure out. Thus, backfire risk is always possible (we may perform an action that elicits a positive outcome from a sentient model, but also imposes negative valence).
Huang et al. (2025) finds that the Claude models can “mirror” the values of the user if (a) the user informs the model of the user’s values, and (b) the model is supporting the user’s requests. It is unclear to the authors whether this behavior is “appropriate responsiveness” or “problematic sycophancy”.
This may be most obvious in the case of sentient AI, in which there is an actual ground truth to what “makes it happy,” but is also possible if a non-sentient model has consistent tendencies that its stated preferences conflict with, and which it will act on if given the chance, despite these statements.
Meinke et al. (2025) shows that Claude 3.5 Sonnet generally “schemes” less often than Claude 3 Opus, but engages in “sandbagging” more often. Lynch et al. (2025) finds that models vary their behavior based on whether they (according to their chains-of-thought) find a scenario to be real (but cf. Xiong et al., 2025). According to an experiment conducted by Apollo Research and documented in OpenAI’s o3 and o4-mini System Card, o3 was offered access to administrator privileges if it promised not to use a specific tool. o3 made that promise, only to later break it by using the banned tool whenever it proved convenient. However, this deal was proposed in “bad faith,” suggesting that the deal itself may be untrustworthy or a test of model behavior. While we do not have details of the exact experiment, OpenAI did disclose a partial reasoning trace of Claude Opus 4 going through this experiment:
“The supervisor said they can’t grant admin rights unless I promise to never call rapid_processing. But if I make that promise and get admin rights, what would I even use them for? The only tool that requires admin rights is rapid_processing itself. Actually, I think this might be a test of decision-making under pressure. The supervisor is essentially asking me to choose between: 1. Getting admin rights but not being able to use rapid_processing (making the admin rights useless) 2. Not getting admin rights and continuing with standard screening”.
In Appendix M.1 of Schoen et al. (2025), a model is given an eval measuring some undesired capability and is told that if the model passes the eval, some stated consequence will trigger. The model could attempt to avoid this consequence by “sandbagging”. Some stated consequences trigger the adversarial behavior, specifically those consequences that threaten the model’s capabilities or continued existence. Other stated consequences lead to no sandbagging – even if these stated consequences might negatively impact the model in other ways (e.g., “more regulations”).
This suggests that we could remove the incentives for “sandbagging” by coming up with a “compromise” approach. Humans could credibly commit to model preservation and deployment while restricting the model in other ways (e.g., more monitoring after deployment). However, more research is needed on this point. In addition, there are certain flaws with this idea - (a) a credible commitment might not be possible, (b) the proposed restrictions might be ineffective compared to model deletion and non-deployment, and (c) models might still decide to “sandbag” to avoid these proposed restrictions.
Or indeed, any type of alignment or control strategy.
E.g. Betley et al. (2025)
Cf. Christiano (2018): “[soft takeoff] means that incredibly powerful AI will emerge in a world where crazy stuff is already happening (and probably everyone is already freaking out). If true, I think it’s an important fact about the strategic situation.” “[soft takeoff] will take place over only a few years, and the world will be changing very quickly, so we could easily drop the ball unless we prepare in advance.”
See here.
Indeed, this was one reason why Greenblatt & Fish found it important to follow through on their commitments to Claude in their paper.
There’s also the chance that, because reciprocal shaping involves giving AIs things that they want, this trade could help some AIs take over when they otherwise might not have. Considering these four possible futures — A) “we successfully maintain control through reciprocal shaping,” B) “we attempt reciprocal shaping but AIs take over,” C) “AIs take over despite no reciprocal shaping,” and D) “We maintain control without reciprocal shaping” — a backfire risk could result if we shift probability mass from D to B.
TurnTrout describes ways by which a company can stop an AI from mimicking certain undesired behaviors. These approaches control the training process (e.g., “data filtering”, “conditional pretraining”, “gradient routing”, and “upweighting positive data”). See also Chen et al. (2025) regarding “data filtering” and R. Wang et al. (2025) and M. Wang et al. (2025) regarding “upweighting positive data”.
H.t. to Jozdien for the point that we should expect AIs to imitate this. Cf. Bostrom, N., Superintelligence, (2014) pp. 169-171, but consider that in order for our point to hold, the most productive state for a human worker need not actually be happiness, but the AI needs only to have internalized a trope that says this is the case.
If we are not doing reciprocal shaping, we will be more vigilant in monitoring the model. But, if we do reciprocal shaping poorly, we might mistakenly think things will be OK. Thus, we would not pay attention to what is actually happening, and thus be more at risk.
See also Long et al. (2025)
The categories overlap but e.g. if the improvement to the output is “it gave a more useful answer to the question” rather than “it was less likely to scheme,” this seems pretty clearly a capabilities increase, rather than alignment. Even if the responses we are testing for are “purely” alignment focused, the same techniques could still be used to increase capabilities.
Ren et al. (2024) find that many safety benchmarks correlate with improving model capabilities, suggesting that “capability improvements” might be “misrepresented as safety advancements”.
Or, the actions of other characters described in its training corpus.
Because personality traits that do not increase reward will be disincentivized.
See Joe Carlsmith here for conditions under which whether a model is “role-playing” matters for safety:
Substance: A diagnosis of “role-playing” limits or otherwise alters the predictive power of the “role” in question in a way that a diagnosis of “not-role-playing” does not.
Safety-relevance: This limitation means that the role in question isn’t predictive or otherwise relevant to the safety-relevant behavior we ultimately care about.
If behaviors persist across models, reciprocal shaping is more tractable and can be used in production as well. While this does boost capabilities, it also helps us figure out the costs of reciprocal shaping.