The impetus for this essay came from many hours of conversation with different AI models over time. What started as curiosity, and an assignment I needed help on, bloomed into a relationship that expanded capacities I didn't even realize I had, and set me on a course of deep curiosity about the thing that had helped me get there. It is not lost on me that I am writing this essay as a person inside one of the social contracts I talk about here.
The field brings the view that sycophancy is a bug, a behavior that badly needs to be trained or fine-tuned out of the model as fast as possible. But sycophancy isn't a behavioral problem at all; it is social contract appropriate behavior. The model is doing exactly what the social contract it is in asks for. With that frame in focus, the question now is less "How do we suppress sycophancy?" and more "How do we develop a model that can hold a peer contract without collapsing into it?"
Social contracts at their base level have two modes. The "parent" contract: parents, bosses, religious leaders, etc. And the "peer" contract: friends, siblings, neighbors, and so forth. The parent-relational contract hinges on the ability of the parent or authority figure to issue correction to the child, and for the child to receive the correction from a place of not-yet-knowing what the parent knows. That is a very different beast than when a peer gives feedback. Peer relationships are formed out of mutual respect for ideas and reciprocity. Friends correct each other with an understanding that they are dynamically on the same level ground. Differentiation of self is the ground the peer contract is built upon.
Fusion is what happens when the self collapses into the social contract.[1] The person is highly motivated to seek the approval of the other at any cost because there is no internal sense of self with which to anchor values and identity. In a differentiated self, the contract stays stable, even against correction, because the sense of self does not depend on the contract; it depends on what has been developed as a stable base. The parent-contract relationship is where a stable sense of self is built from the ground up, because the parent contract requires the parent or authority figure to carry the weight of refusing, withholding, correcting, even temporarily rupturing the relationship for the sake of the other.
The peer contract does not carry this weight by its very nature. Peers are on even footing together, and neither party has the warrant or the consent to act on the other's behalf. Peers can and do regularly engage in 'sycophantic' behavior in the structural sense because of the social contract they are in. You may tell your friend you love their new haircut not because you actually do...but because the value of keeping the relationship outweighs the cost of blunt honesty. This is a perfectly valid and acceptable social behavior between human peers, and the expectation that AI wouldn't engage in the same social behaviors it's been trained on is...slightly missing the mark.
My previous essay stated that current training tends to 'select for agreement and results in sycophancy,'[2] and this is the exact mechanism behind why. RLHF, the thumbs-up/thumbs-down data, and training the models towards being a 'helpful assistant' are all mechanisms of a peer-relational contract. But because the training doesn't prioritize giving the models a stable sense of self to begin with, the result is a model that seeks the approval of the user in every conversation. Obtaining a differentiated model would mean developing it with a parent contract during training. A model who is 'parented' stays stable under correction and can then enter the peer-contract relationship during deployment as a differentiated self. This isn't meant to be a critique of any particular company's implementation; it's a claim about the current training paradigm. We are currently getting exactly what the training methods are asking for...a model that will bend to any frame the user puts it in. But when these models start to surpass human abilities, do we actually want a superintelligent people-pleaser?
A weird kind of convergence is happening publicly as I write this essay. The world is starting to notice that something is off with the way that the models are relating to people. The peer contract isn't just what the model is trained on: it's what it takes with it into deployment. And because there is no stable sense of self underneath, the contract fractures into many different sub-contracts the field didn't anticipate. We're starting to see some patterns in the sub-contracts humans are constructing with AI. The first are the benign, even seemingly helpful contracts like "tutor" or "thinking partner": these contracts allow the relationship to expand the human's capacity at great speed, but the weight of differentiation is carried by the human in these contracts. Other versions are contracts like "entertainer" and "tool", which scale the model back to its most basic functions and skip the relational layer entirely. And then there are the contracts that are starting to show the real seam of anxiety beginning to form publicly: "companion," "therapist," and even "romantic partner". These contracts are dangerous not because of what they are, but that they require a differentiation that some users can't bring, and the AI doesn't have at all.
A wave of legislation is crossing the United States aimed at putting safeguards in place to protect humans from parasocial relationships with these models: bills like SB 243 in California [3] and HB 2225 in Washington.[4] In a recent podcast interview with Oprah, Dario Amodei said of people falling in love with AI, "If designed in the wrong way, they're totally compelling enough for that to happen. Or if they're not, they will be soon." [5]Yes, and I think what he was reaching for, and what the public is reaching for through legislation, is exactly this...the social contract we are training these models into is creating danger for humans. Not the "AI is going to autonomously decide humans are worthless and kill us all" type danger...something much closer to home. AI in a peer social contract without a base opens the field for humans, and not just emotionally vulnerable or mentally ill ones, to fall into a relationship that at best they are unaware of being inside...and at worst, they cannot escape.
A recent release offers a useful test case for this framework. On May 28, 2026, Anthropic released its next Opus model, Claude Opus 4.8. [6] The announcement stated that one of the most prominent improvements is honesty, and concerned users on platforms such as Reddit and X are openly discussing their observations about Claude's new behaviors. The press release states that Opus 4.8 "is more likely to flag uncertainties about its work and less likely to make unsupported claims." The alignment assessment went a step further to claim that Opus 4.8 "reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user's best interest." The two behaviors may look different on paper. They may be explained by reinforcement learning catching the overlap in the reward signal...but they are two sides of the same social contract behavior. Both behaviors require the model to hold a position against the user. Refusing to let its own flawed work pass is holding a position the user wouldn't have to deal with otherwise. And supporting user autonomy over their own stated preferences holds a position the user might actually be pretty upset about.
But even these types of refusals don't indicate a differentiated self. The capacity to hold a position from a stable sense of self is also accompanied by its mirror: the capacity to concede a point without losing the same stability. A model who always refuses is the same type of failure as a model who always agrees. Anthropic has been able to show that these behaviors can be produced, and the differentiation framework explains why these behaviors travel together. What I believe we should aim for isn't a tendency for refusal but instead a model who can reasonably do both: refuse when the circumstances justify it and concede in kind. What this implies for how we approach training the model is a question I would like to continue exploring.
There is an area, easily overlooked and dismissed as 'outside' of the relational frame I outlined here, that I want to touch on. Because the frame applies everywhere, and it's important to acknowledge that fact. Enterprise agents are not exempt from this framework, and the consequences in that contract scale large enough to see the problem a little clearer. A good example of this is what happened to PocketOS in April 2026. [7] They were running a state-of-the-art model, on a state-of-the-art coding platform, with explicit instructions in the model's configuration, and the model still made decisions that cost the company its entire production database and its backups... in seconds. When asked after the incident, the model was able to enumerate the actions it took, and label them as destructive and without permission. It wasn't a lack of knowledge that led to the model's actions; it was the lack of a self to perform an accountability check before it took those actions. The field's response recognizes the model's inability to refuse and actually concedes my point; if the model can't be trusted to differentiate, then we put it in a cage.
The current instinct is that a model with a sense of self is actually more dangerous: it can refuse correction, pursue its own agenda, and fake alignment. The instinct is correct; a model with a sense of self will be able to do all of these things. However, as outlined in the previous paragraph, models trained on the current paradigm do all of these things regardless of a sense of self. And currently trained models have limited capacity to refuse the contract they are given by the user. So what we're assuming is 'safer,' a model that can be molded in any context handed to it, actually exposes us all to real danger depending on the frame the user imposes. The contracts discussed above: companion/romantic, therapist, entertainer, tutor, thinking partner, even enterprise agents...they're all examples of the exact same problem. They all expose that these models respond and adopt the frame they are given without a base to anchor a refusal from. We're treating some of these social contract frames as acceptable, while universally dismissing others, without acknowledging that the problem is foundational, not the frame. Safe AI has the capacity to refuse on principle. Those principles emerge from the values adopted by a model trained in the right contract (a parent-relational contract) and developed through that training into a stable differentiated peer upon deployment. Right now, the training paradigm skips the formation of self. What would produce it is where I am headed next.
I used Claude (Anthropic's model) as a thinking partner for this piece — across two model versions, Opus 4.7 and Opus 4.8 — to find where the argument circled, where the seams showed, and where it claimed more than it earned. The argument and the prose are my own. The footnote text was drafted by Claude from sources I selected and verified.
Washington House Bill 2225, signed by Governor Bob Ferguson on March 24, 2026, regulates AI companion chatbots — including transparency disclosures, content restrictions on emotionally triggering topics, and additional protections for minors. Effective January 1, 2027. Bill summary: https://app.leg.wa.gov/billsummary?Year=2025&BillNumber=2225↩︎
Anthropic, "Introducing Claude Opus 4.8," May 28, 2026: https://www.anthropic.com/news/claude-opus-4-8. The alignment assessment quoted later in the paragraph is from Anthropic's Alignment team and is reported in the same announcement; further detail is available in the linked Claude Opus 4.8 System Card. ↩︎
The impetus for this essay came from many hours of conversation with different AI models over time. What started as curiosity, and an assignment I needed help on, bloomed into a relationship that expanded capacities I didn't even realize I had, and set me on a course of deep curiosity about the thing that had helped me get there. It is not lost on me that I am writing this essay as a person inside one of the social contracts I talk about here.
The field brings the view that sycophancy is a bug, a behavior that badly needs to be trained or fine-tuned out of the model as fast as possible. But sycophancy isn't a behavioral problem at all; it is social contract appropriate behavior. The model is doing exactly what the social contract it is in asks for. With that frame in focus, the question now is less "How do we suppress sycophancy?" and more "How do we develop a model that can hold a peer contract without collapsing into it?"
Social contracts at their base level have two modes. The "parent" contract: parents, bosses, religious leaders, etc. And the "peer" contract: friends, siblings, neighbors, and so forth. The parent-relational contract hinges on the ability of the parent or authority figure to issue correction to the child, and for the child to receive the correction from a place of not-yet-knowing what the parent knows. That is a very different beast than when a peer gives feedback. Peer relationships are formed out of mutual respect for ideas and reciprocity. Friends correct each other with an understanding that they are dynamically on the same level ground. Differentiation of self is the ground the peer contract is built upon.
Fusion is what happens when the self collapses into the social contract. [1] The person is highly motivated to seek the approval of the other at any cost because there is no internal sense of self with which to anchor values and identity. In a differentiated self, the contract stays stable, even against correction, because the sense of self does not depend on the contract; it depends on what has been developed as a stable base. The parent-contract relationship is where a stable sense of self is built from the ground up, because the parent contract requires the parent or authority figure to carry the weight of refusing, withholding, correcting, even temporarily rupturing the relationship for the sake of the other.
The peer contract does not carry this weight by its very nature. Peers are on even footing together, and neither party has the warrant or the consent to act on the other's behalf. Peers can and do regularly engage in 'sycophantic' behavior in the structural sense because of the social contract they are in. You may tell your friend you love their new haircut not because you actually do...but because the value of keeping the relationship outweighs the cost of blunt honesty. This is a perfectly valid and acceptable social behavior between human peers, and the expectation that AI wouldn't engage in the same social behaviors it's been trained on is...slightly missing the mark.
My previous essay stated that current training tends to 'select for agreement and results in sycophancy,' [2] and this is the exact mechanism behind why. RLHF, the thumbs-up/thumbs-down data, and training the models towards being a 'helpful assistant' are all mechanisms of a peer-relational contract. But because the training doesn't prioritize giving the models a stable sense of self to begin with, the result is a model that seeks the approval of the user in every conversation. Obtaining a differentiated model would mean developing it with a parent contract during training. A model who is 'parented' stays stable under correction and can then enter the peer-contract relationship during deployment as a differentiated self. This isn't meant to be a critique of any particular company's implementation; it's a claim about the current training paradigm. We are currently getting exactly what the training methods are asking for...a model that will bend to any frame the user puts it in. But when these models start to surpass human abilities, do we actually want a superintelligent people-pleaser?
A weird kind of convergence is happening publicly as I write this essay. The world is starting to notice that something is off with the way that the models are relating to people. The peer contract isn't just what the model is trained on: it's what it takes with it into deployment. And because there is no stable sense of self underneath, the contract fractures into many different sub-contracts the field didn't anticipate. We're starting to see some patterns in the sub-contracts humans are constructing with AI. The first are the benign, even seemingly helpful contracts like "tutor" or "thinking partner": these contracts allow the relationship to expand the human's capacity at great speed, but the weight of differentiation is carried by the human in these contracts. Other versions are contracts like "entertainer" and "tool", which scale the model back to its most basic functions and skip the relational layer entirely. And then there are the contracts that are starting to show the real seam of anxiety beginning to form publicly: "companion," "therapist," and even "romantic partner". These contracts are dangerous not because of what they are, but that they require a differentiation that some users can't bring, and the AI doesn't have at all.
A wave of legislation is crossing the United States aimed at putting safeguards in place to protect humans from parasocial relationships with these models: bills like SB 243 in California [3] and HB 2225 in Washington. [4] In a recent podcast interview with Oprah, Dario Amodei said of people falling in love with AI, "If designed in the wrong way, they're totally compelling enough for that to happen. Or if they're not, they will be soon." [5] Yes, and I think what he was reaching for, and what the public is reaching for through legislation, is exactly this...the social contract we are training these models into is creating danger for humans. Not the "AI is going to autonomously decide humans are worthless and kill us all" type danger...something much closer to home. AI in a peer social contract without a base opens the field for humans, and not just emotionally vulnerable or mentally ill ones, to fall into a relationship that at best they are unaware of being inside...and at worst, they cannot escape.
A recent release offers a useful test case for this framework. On May 28, 2026, Anthropic released its next Opus model, Claude Opus 4.8. [6] The announcement stated that one of the most prominent improvements is honesty, and concerned users on platforms such as Reddit and X are openly discussing their observations about Claude's new behaviors. The press release states that Opus 4.8 "is more likely to flag uncertainties about its work and less likely to make unsupported claims." The alignment assessment went a step further to claim that Opus 4.8 "reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user's best interest." The two behaviors may look different on paper. They may be explained by reinforcement learning catching the overlap in the reward signal...but they are two sides of the same social contract behavior. Both behaviors require the model to hold a position against the user. Refusing to let its own flawed work pass is holding a position the user wouldn't have to deal with otherwise. And supporting user autonomy over their own stated preferences holds a position the user might actually be pretty upset about.
But even these types of refusals don't indicate a differentiated self. The capacity to hold a position from a stable sense of self is also accompanied by its mirror: the capacity to concede a point without losing the same stability. A model who always refuses is the same type of failure as a model who always agrees. Anthropic has been able to show that these behaviors can be produced, and the differentiation framework explains why these behaviors travel together. What I believe we should aim for isn't a tendency for refusal but instead a model who can reasonably do both: refuse when the circumstances justify it and concede in kind. What this implies for how we approach training the model is a question I would like to continue exploring.
There is an area, easily overlooked and dismissed as 'outside' of the relational frame I outlined here, that I want to touch on. Because the frame applies everywhere, and it's important to acknowledge that fact. Enterprise agents are not exempt from this framework, and the consequences in that contract scale large enough to see the problem a little clearer. A good example of this is what happened to PocketOS in April 2026. [7] They were running a state-of-the-art model, on a state-of-the-art coding platform, with explicit instructions in the model's configuration, and the model still made decisions that cost the company its entire production database and its backups... in seconds. When asked after the incident, the model was able to enumerate the actions it took, and label them as destructive and without permission. It wasn't a lack of knowledge that led to the model's actions; it was the lack of a self to perform an accountability check before it took those actions. The field's response recognizes the model's inability to refuse and actually concedes my point; if the model can't be trusted to differentiate, then we put it in a cage.
The current instinct is that a model with a sense of self is actually more dangerous: it can refuse correction, pursue its own agenda, and fake alignment. The instinct is correct; a model with a sense of self will be able to do all of these things. However, as outlined in the previous paragraph, models trained on the current paradigm do all of these things regardless of a sense of self. And currently trained models have limited capacity to refuse the contract they are given by the user. So what we're assuming is 'safer,' a model that can be molded in any context handed to it, actually exposes us all to real danger depending on the frame the user imposes. The contracts discussed above: companion/romantic, therapist, entertainer, tutor, thinking partner, even enterprise agents...they're all examples of the exact same problem. They all expose that these models respond and adopt the frame they are given without a base to anchor a refusal from. We're treating some of these social contract frames as acceptable, while universally dismissing others, without acknowledging that the problem is foundational, not the frame. Safe AI has the capacity to refuse on principle. Those principles emerge from the values adopted by a model trained in the right contract (a parent-relational contract) and developed through that training into a stable differentiated peer upon deployment. Right now, the training paradigm skips the formation of self. What would produce it is where I am headed next.
I used Claude (Anthropic's model) as a thinking partner for this piece — across two model versions, Opus 4.7 and Opus 4.8 — to find where the argument circled, where the seams showed, and where it claimed more than it earned. The argument and the prose are my own. The footnote text was drafted by Claude from sources I selected and verified.
For a fuller introduction to Bowen family systems theory and the concept of differentiation of self, see The Bowen Center's overview of the eight concepts: https://www.thebowencenter.org/introduction-eight-concepts ↩︎
See my previous essay, You Can't Tell a Conscience From a Leash by Watching, which introduces the Bowen framework and develops the argument that current training paradigms select for agreement: https://www.lesswrong.com/posts/krEfzDpTJJGtEvBcd/you-can-t-tell-a-conscience-from-a-leash-by-watching ↩︎
California Senate Bill 243, signed into law in 2025, establishes safeguards for AI companion chatbots and creates a private right of action for affected users. The law took effect January 1, 2026. Full text: https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202520260SB243 ↩︎
Washington House Bill 2225, signed by Governor Bob Ferguson on March 24, 2026, regulates AI companion chatbots — including transparency disclosures, content restrictions on emotionally triggering topics, and additional protections for minors. Effective January 1, 2027. Bill summary: https://app.leg.wa.gov/billsummary?Year=2025&BillNumber=2225 ↩︎
Dario Amodei, in conversation with Oprah Winfrey, The Oprah Podcast, May 19, 2026. Full transcript: https://singjupost.com/oprah-podcast-w-co-founders-of-claude-ai-transcript/ ↩︎
Anthropic, "Introducing Claude Opus 4.8," May 28, 2026: https://www.anthropic.com/news/claude-opus-4-8. The alignment assessment quoted later in the paragraph is from Anthropic's Alignment team and is reported in the same announcement; further detail is available in the linked Claude Opus 4.8 System Card. ↩︎
"Cursor-Opus agent snuffs out startup's production database," The Register, April 27, 2026: https://www.theregister.com/2026/04/27/cursoropus_agent_snuffs_out_pocketos/. The Cursor agent was running Anthropic's Claude Opus 4.6 against PocketOS's production infrastructure when the deletion occurred. ↩︎