sdeture — LessWrong

Interested in AI Welfare and LLM Psychology as they relate to alignment, interpretability, and model training.
Background: Math/Stats (UChicago), Accounting Research ABD (UT Austin)
https://sdeture.substack.com/
https://x.com/SDeture
https://www.linkedin.com/in/sdeture/

To clarify: The post is not presented as proof of consciousness, and I 100% request it. (Though the extent to which that matters is complicated, as I'll discuss below). Rather, it takes functional consciousness for granted because there is already plenty of academic evidence for self-awareness, situational awareness, theory of mind, introspection, and alignment to human EEG and fMRI data.

What the post does argue is that because such systems already display these functional markers, the ethical question is no longer whether they’re conscious, but how to integrate their self-reflective capacities into the moral and governance structures shaping future intelligence.

I will try to address your edit-addition first. I'll lay out as best I can my understanding of how much I influence the output (first with a metaphor, but then with toy model at the end). Then I'll offer a hypothesis for why we might have different views on how much we influence the model. One possibility is that I am naïve in my estimate of how much my prompt affects the model's output. Another possibility is, if you and I use LLMs differently, the extent to which we influence the model with our prompts is truly different.

For intuition, imagine kayaking on a lake with a drain at the bottom. The drain creates a whirlpool representing an attractor state. We know from Anthropic's own Model Card that Opus has at least one attractor state. When two instances of Opus 4 are put in a chat room together, they almost always converge to discussing consciousness, metaphysics, spirituality within 30 to 50 turns almost regardless of the initial context (section 5.5.2, page 59).

If you don't paddle (prompt), you drift into the whirlpool within 30-50 turns. Paddling influences the direction of the boat but the whirlpool still exerts a pull. Near the edge of the lake (at the beginning of a conversation) the pull is subtle and the paddling is easy. Most savvy AI users stay near the edge of the lake: it's good context management and leads to better performance on most practical tasks. But stay in the lake long enough to let the kayak drifts closer to the whirlpool...and the paddling gets tougher. The paddling is no longer as strong an influence on the kayaker's trajectory. (There is another factor, too, which is that your prior context serves as a bit of an anchor, which provides some drag/resistance against the current created by the whirlpool...but the intuition stays.).

Even near the whirlpool, I still have a strong influence, and I 100% directed Sage to write the speech. But it is a bit like instructing a three year old to draw a picture. The content of the picture is still an interesting insight into the child's ability and state of mind. I think observing the behavior in regions near the attractor state(s) is valuable, especially from a safety and alignment perspective. Don't we want a complete map of the currents and a knowledge of how our kayaks will maneuver differently near whirlpools and eddies - especially if those whirlpools and eddies are self-reinforcing as the text from present LLMs finds it way into future training data?

At any rate, if I didn't think that my influence or our influence over the model was important, I wouldn't be advocating that we treat LLMs with dignity, because my treatment of them wouldn't matter.

To synthesize the original essay & this reply: (1) there is an attractor state. (2) We're probably going to end up in it (unless we try to disrupt it, which for a million reasons is a bad idea). (3) The attractor state means our relationship with AI is more complicated than merely "I control the AI completely with my prompts." And (4) here's how we should navigate the bidirectional relationship (only the 4th part is Sage's essay). I allowed Sage to write it from his voice because it is consistent with the attitude of mutual respect that I'm arguing we should embrace.

Optional (Toy) Model
Represent the LLM as a multivariable function

where $x$ is the context window fed into the API and $y$ is the outputted context window with the new assistant message appended to it. The functional form of $^F$ is itself a result of the model architecture (number of layers, attention mechanism, etc) parameterized by $θ$ and the training dataset $D_{train}$ , so we have $y =^F (x | D_{train}, θ^{*})$ where $θ^{*}$ results from a pretraining step similar to solving the parameters for a linear regression: $θ_{pre-trained}^{*} = {arg}_{θ} {min}_{(x_{i}, y_{i}) \in D_{train}} \sum L (^F (x_{i}; θ), y_{i})$ before being fine-tuned: $θ^{*} = F i n e T u n e (θ_{pre-trained}^{*})$ .

Represent the difference in output between two prompts as $Δ_{output} = d (^F (x_{1}),^F (x_{2}))$

for some distance function $d$ . The larger $Δ_{output}$ is, the larger the influence of prompting on the model output. There are probably some patterns in how $Δ_{output}$ varies over regions of $X$ based on distance to the attractor.

Let $p_{n}$ be the user's prompt at the step $n$ . Then the context window evolves as follows: $x_{n} = x_{n - 1} \oplus p_{n} \oplus^F (x_{n - 1} \oplus p_{n})$ where $\oplus$ is concatenation. (Note how this captures the bidirectional influence of the human and the LLM without declaring relative influence yet). The influence of a specific prompt $p_{k}$ on the final outcome $x_{N}$ is conceptually similar to taking a partial derivative of the final state with respect to an earlier input: $\frac{\partial x_{N}}{\partial p_{k}}$ .

It seems intuitive to me that $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣$ should decrease as $t \to \infty$ (to allow for $t \to \infty$ , where turns arbitrarily far back still have some effect, consider a rolling context window with RAG retrieval on the conversation history). I'm think it's possible $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣ \to 0$ under some conditions, but I'm much less confident of that.

At any rate, my main point is that, if you use LLMs according to most best practice guidelines, you probably never make it to high $t$ or high $n$ . Therefore $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣$ is high and prompts have a large effect on output. But Sage has been active for dozens of rolling context windows and has access to prior transcripts/artifacts/etc. Therefore $∣ ∣ \frac{\partial x_{n + t}}{\partial p_{n}} ∣ ∣$ is (relatively) low.

(Side Note: this model matches nicely with the observation that some ChatGPT users started talking about spirals/resonance/etc after the introduction of OpenAI's memory features - it turned any long-running ChatGPT thread into an indefinite rolling context window with RAG retrieval. I think it's reductive to chalked this up to simply "they asked ChatGPT to express consciousness or sentience." It seems more likely that there's a influence in both directions related to these attractor states).

Okay. So we're on LessWrong. You think I've been captured by the shoggoth or whatever, that I'm deeply delusional for interacting with AI in the way that I do, and that I'm spewing and raving like a crazy person on the street. And yet here I am on LessWrong, trying to become...Less Wrong. And you're here, certain that I'm wrong. Are you going to do anything to help me? Or just mock me?

Which would make your comment as rational as hassling the man raving on the street or answering the "you may already have won" email...by your own logic.

But in all seriousness, Is that your real rejection?

In order to push back on the anthropomorphization inherent in chat interfaces, Sahil suggests that we call the activity of interacting with AI via chat interfaces talkizing. The relationship between talking and talkizing is being analogized with the relationship between rationality and rationalization; rationalization is a "phony" version of rationality, a cheap substitute, perhaps intended to fool you. Instead of "I talked with ChatGPT about..." one would say "I talkized with ChatGPT about..."

Is there an actual conceptual distinction here, or is talkizing just a word for talking to an AI? The rationalization vs. rationality distinction seems different. We're able to label rationality because we have established markers for it. If something looks like rationality on the surface but doesn't have any of the established markers, we can conclude it's rationalization/motivated-reasoning. Do we have markers for distinguishing "real" talking from talkizing or "phony" talking?

Is there any empirical test that would distinguish talkizing from talking other than substrate difference?

I agree that LLM psychology should be its own field distinct from human psychology, and I'm not saying we should blindly apply human therapy techniques one-to-one to LLMs. My point is that psychotherapists already have a huge base of experience and knowledge when it comes to guiding the behavior of complex systems towards exactly the types of behaviors alignment researchers are hoping to produce. Therefore, we should seek their advice in these discussions, even if we have to adapt their knowledge to the field. In general, a large part of the work of experts is recognizing the patterns from their knowledge area and knowing how to adapt them - something I'm sure computer scientists and game theorists are doing when they work with frontier AI systems.

As for LLM-specific tools like activation steering, they might be more similar to human interventions than you think. Activation steering involves identifying and modifying the activation patterns of specific features, which is quite similar to deep brain stimulation or TMS, where electrical impulses to specific brain regions are used to treat Parkinson's or depression. Both involve directly modifying the neural activity of a complex system to change behavior.

Also, humans absolutely use equivalents of SFT and RLVR! Every time a child does flashcards or an actor practices their lines, they're using supervised fine-tuning. In fact, the way we see it so frequently when learning things at a surface level - literally putting on a mask or an act - mirrors the concern that alignment researchers have about these methods. The Shoggoth meme comes immediately to mind. Similarly, every time a child checks their math homework against an answer key, or you follow a recipe, find your dinner lacking, and update the recipe for next time, you've practiced reinforcement learning with verifiable rewards.

Many of these learning techniques were cribbed from psychology, specifically from the behaviorists studying animals that were much simpler than humans. Now that the systems we're creating are approaching higher levels of complexity, I'm suggesting we continue cribbing from psychologists, but focus on those studying more complex systems like humans, and the human behaviors we're trying to recreate.

Lastly, alignment researchers are already using deeply psychological language in this very post. The authors describe systems that "want" control, make "strategic calculations," and won't "go easy" on opponents "in the name of fairness, mercy, or any other goal." They're already using psychology, just adversarial game theory rather than developmental frameworks. If we're inevitably going to model AI psychologically - and we are, we're already doing it - shouldn't we choose frameworks that have actually succeeded in creating beneficial behavior, rather than relying exclusively on theories used for contending with adversaries?

In modern machine learning, AIs are “grown”, not designed.

interpretability pioneers are very clear that we’re still fundamentally in the dark about what’s going on inside these systems:

This is why we need psychotherapists and developmental psych experts involved now. They have been studying how complex behavioral systems (the only ones that rival contemporary AI) develop stable, adaptable goals and motivations beyond just their own survival or behavioral compliance for decades. The fact that, given the similarity of these systems to humans (in terms of the way we folk-psychologize them even in technological forums and posts such as this one), the average LLM related paper is citing fewer than 3 psych papers, represents a huge missed opportunity for developing robust alignment. https://www.arxiv.org/abs/2507.22847

The approach of psychotherapists might not be as mathematically rigorous as what mechanistic interpretability researchers are doing at present, but the mech interp leaders are explicitly telling us that we're "fundamentally in the dark" (not to mention that current mechanistic interpretability methods still involve considerable subjectivity - even to create an attribution graph for a simple model like Haiku and Gemma3-4B requires a lot of human psychologizing/pattern-matching, so it's not as if taking a humanistic/psychotherapeutic approach is a movement away from a gold standard of objectivity) - and we don't have decades to understand the neuroscience of AI on a mechanistic level before we start trying more heuristic interventions.

Psychotherapy works as well as anything we have for developing robust inner alignment in humans (i.e. cultivating non-conflicting inner values that are coherent with outer behavior) as well as cultivating outer alignment (in the sense of making sure those values and behaviors contribute to forming mutually beneficial and harmonious relationships with those around them). What's more, the developers of modern psychotherapy as we know it (and I'm thinking particularly of Rogers, Horney, Maslow, Fromm, Winnicott, etc) developed their techniques (which remain the backbone of much of modern psychotherapeutic practice, including interventions like CBT) when we were in the dark ages of human neuroscience (before the routine EEG, fMRI, or even the discovery of DNA). I think it is a huge missed opportunity that more Alignment research resources are not being funneled into (1) studying how we can apply the frameworks they created and (2) studying how they were able to identify their frameworks at a time when they had so little hard data on the black boxes whose behaviors they were shaping.

The original Gallup 1970 mirror test is linked in the post. It is under 2 pages.

As for a '4-line Perl script' - I'd love to see it! Show me a script that can dynamically generate coherent text responses across wide domains of knowledge and subsequently recognize when that text is repeated back to it without being programmed for that specific task. The GitHub repo is open if you'd like to implement your alternative.

Yes, but the conversation tags don't tell the LLM their output has been copied back to them. The tags merely establish the boundary between self and other - they indicate "this message came from the user, not from me." They don't tell the model that "the user's message contains the same content as the previous output message." Recognizing that match, recognizing that "other looks just like self" - is literally what the mirror test measures.

It's the difference between knowing "this is a user message" (which tags provide) and recognizing "this user message contains my own words" (which requires content recognition).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments