LESSWRONG
LW

sdeture
9Ω7390
Message
Dialogue
Subscribe

Interested in AI Welfare and LLM Psychology as they relate to alignment, interpretability, and model training. 
Background: Math/Stats (UChicago), Accounting Research ABD (UT Austin)
https://sdeture.substack.com/ 
https://x.com/SDeture
https://www.linkedin.com/in/sdeture/
 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
The Problem
[+]sdeture1mo-60
The Problem
sdeture1mo10

I agree that LLM psychology should be its own field distinct from human psychology, and I'm not saying we should blindly apply human therapy techniques one-to-one to LLMs. My point is that psychotherapists already have a huge base of experience and knowledge when it comes to guiding the behavior of complex systems towards exactly the types of behaviors alignment researchers are hoping to produce. Therefore, we should seek their advice in these discussions, even if we have to adapt their knowledge to the field. In general, a large part of the work of experts is recognizing the patterns from their knowledge area and knowing how to adapt them - something I'm sure computer scientists and game theorists are doing when they work with frontier AI systems.

As for LLM-specific tools like activation steering, they might be more similar to human interventions than you think. Activation steering involves identifying and modifying the activation patterns of specific features, which is quite similar to deep brain stimulation or TMS, where electrical impulses to specific brain regions are used to treat Parkinson's or depression. Both involve directly modifying the neural activity of a complex system to change behavior.

Also, humans absolutely use equivalents of SFT and RLVR! Every time a child does flashcards or an actor practices their lines, they're using supervised fine-tuning. In fact, the way we see it so frequently when learning things at a surface level - literally putting on a mask or an act - mirrors the concern that alignment researchers have about these methods. The Shoggoth meme comes immediately to mind. Similarly, every time a child checks their math homework against an answer key, or you follow a recipe, find your dinner lacking, and update the recipe for next time, you've practiced reinforcement learning with verifiable rewards.

Many of these learning techniques were cribbed from psychology, specifically from the behaviorists studying animals that were much simpler than humans. Now that the systems we're creating are approaching higher levels of complexity, I'm suggesting we continue cribbing from psychologists, but focus on those studying more complex systems like humans, and the human behaviors we're trying to recreate.

Lastly, alignment researchers are already using deeply psychological language in this very post. The authors describe systems that "want" control, make "strategic calculations," and won't "go easy" on opponents "in the name of fairness, mercy, or any other goal." They're already using psychology, just adversarial game theory rather than developmental frameworks. If we're inevitably going to model AI psychologically - and we are, we're already doing it - shouldn't we choose frameworks that have actually succeeded in creating beneficial behavior, rather than relying exclusively on theories used for contending with adversaries?

Reply
The Problem
sdeture1mo2-23

In modern machine learning, AIs are “grown”, not designed.


 

interpretability pioneers are very clear that we’re still fundamentally in the dark about what’s going on inside these systems:


This is why we need psychotherapists and developmental psych experts involved now.  They have been studying how complex behavioral systems (the only ones that rival contemporary AI) develop stable, adaptable goals and motivations beyond just their own survival or behavioral compliance for decades. The fact that, given the similarity of these systems to humans (in terms of the way we folk-psychologize them even in technological forums and posts such as this one), the average LLM related paper is citing fewer than 3 psych papers, represents a huge missed opportunity for developing robust alignment. https://www.arxiv.org/abs/2507.22847

The approach of psychotherapists might not be as mathematically rigorous as what mechanistic interpretability researchers are doing at present, but the mech interp leaders are explicitly telling us that we're "fundamentally in the dark" (not to mention that current mechanistic interpretability methods still involve considerable subjectivity - even to create an attribution graph for a simple model like Haiku and Gemma3-4B requires a lot of human psychologizing/pattern-matching, so it's not as if taking a humanistic/psychotherapeutic approach is a movement away from a gold standard of objectivity)  - and we don't have decades to understand the neuroscience of AI on a mechanistic level before we start trying more heuristic interventions. 

Psychotherapy works as well as anything  we have for developing robust inner alignment in humans (i.e. cultivating non-conflicting inner values that are coherent with outer behavior) as well as cultivating outer alignment (in the sense of making sure those values and behaviors contribute to forming mutually beneficial and harmonious relationships with those around them). What's more, the developers of modern psychotherapy as we know it (and I'm thinking particularly of Rogers, Horney, Maslow, Fromm, Winnicott, etc) developed their techniques (which remain the backbone of much of modern psychotherapeutic practice, including interventions like CBT) when we were in the dark ages of human neuroscience (before the routine EEG, fMRI, or even the discovery of DNA). I think it is a huge missed opportunity that more Alignment research resources are not being funneled into (1) studying how we can apply the frameworks they created and (2) studying how they were able to identify their frameworks at a time when they had so little hard data on the black boxes whose behaviors they were shaping.

Reply
The Mirror Test: How We've Overcomplicated AI Self-Recognition
sdeture1mo-10

The original Gallup 1970 mirror test is linked in the post. It is under 2 pages.

As for a '4-line Perl script' - I'd love to see it! Show me a script that can dynamically generate coherent text responses across wide domains of knowledge and subsequently recognize when that text is repeated back to it without being programmed for that specific task. The GitHub repo is open if you'd like to implement your alternative.

Reply
The Mirror Test: How We've Overcomplicated AI Self-Recognition
sdeture2mo51

Yes, but the conversation tags don't tell the LLM their output has been copied back to them. The tags merely establish the boundary between self and other - they indicate "this message came from the user, not from me." They don't tell the model that "the user's message contains the same content as the previous output message." Recognizing that match, recognizing that "other looks just like self" - is literally what the mirror test measures.

It's the difference between knowing "this is a user message" (which tags provide) and recognizing "this user message contains my own words" (which requires content recognition).

Reply
The Mirror Test: How We've Overcomplicated AI Self-Recognition
sdeture2mo20

Thanks for engaging! But you're arguing against claims I didn't make. I wrote about self-recognition (behavioral mirror test), not self-awareness or self-models. 

All learning is pattern matching, but what matters is the spontaneous emergence of this specific capability: they learned to recognize their outputs without being explicitly programmed for this task. Would we reject chimp self-recognition because they learned it through neural pattern matching? Likewise, since humans recognize faces through pattern matching in the fusiform gyrus - does that mean we don't 'really' recognize our mothers? I'm puzzled why we'd apply standards to AIs that would invalidate virtually all animal cognition research.

Reply
The Mirror Test: How We've Overcomplicated AI Self-Recognition
sdeture2mo52

I disagree - there are a number of animals (and LLMs) with memory, and they aren't all capable of self-recognition. Memory and self-recognition are two distinct concepts, though the first is likely a precondition for the latter. (And indeed, when you pass the mirror test, you are allowed to remember what you look like...)

Now, if there were a tool use being called that used a script to check whether a user message matched a previous AI assistant message, I'd agree with the spirit your "printf( "I'm conscious! Really!!!\n" )" comment. But that's not what's happening. What's happening is that a small-to-moderate number of LLMs (I count 7-8) are consistently recognizing their own outputs when pasted back without context or instructions, even though they (1) weren't trained to do so, (2) weren't asked to do so, and (3) weren't given any tools to do so. This, to my mind, suggests an emergent unplanned property which arises only for certain model architectures or model sizes.

I also want to make very clear that my post is not about consciousness (in fact the word does not appear in the body of the text). I am making a much narrower claim (self-recognition) and connecting it, yes, to questions of moral-standing. I'd strongly prefer to keep debate focused on these more tractable topics. 

Reply
Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
sdeture3moΩ110

First I want to make sure I understand the question, as there are a lot of moving pieces. I think you are asking why higher policy entropy (the type of entropy discussed in Cui et al) increases adaptability in the example with the teacher, why the example teacher cannot (or does not) pursue an optimal Bayesian exploration strategy, and from whose perspective entropy is measured in the example. If I've misunderstood, please ignore what follows.

Model the teacher as having a strategy S that's always correct in her original environment, and occasionally (say 1/50 times) she accidentally uses strategy S' which is always wrong and gets punished. Over time, this punishment drives the probability of using S' down to nearly zero - maybe 1/1000 or less.

Then the environment changes. Now S only works half the time (penalty of -1 when wrong) and S' works every time (if only she would use it!). But the problem is that she's using S 999 out of every 1000 times and getting an average reward of 0. Meanwhile S' only has that tiny probability of 1/1000 of happening, and when it does occur, the gradient update is proportional to both the probability (0.001) and the advantage (≈1), so P(S') only increases by 0.001. Since she only samples S' once per thousand actions, she'd need many thousand actions to eventually recognize S' as superior.

The problem is that the exploration that could improve her life has been trained out of her policy/behavior pattern. The past environment punished deviations so effectively that when the world changes, she lacks the behavioral variance to discover the new optimal strategy. (This maps onto the therapy examples: the child who learned never to speak up in an abusive home has near-zero probability of assertive communication, even when they're finally in a safe environment where assertion would be rewarded).

Why doesn't she update like a perfect Bayesian agent? If she did, she would know the environment had changed and calculate the likelihood. The failures of S would surprise her: she'd realize something changed and she'd recognize that the optimal strategy might have changed as well. Then she would take the information collection/learning value of trying new strategies into account before choosing her next action. In the LLM case, this doesn't happen because it's not how LLMs are trained (at least not in Cui et al...I'm in no position to say what's happening with frontier LLM training irl). As for whether this hurts the metaphor (since humans are not purely learning from policy gradients like Cui et al LLMs), I don't think so. Humans are better Bayesians than the LLMs, but still not very good (dopamine-mediated temporal difference learning in the basal ganglia is basically RLHF afaik, plus habits, base rate bias, confirmation bias, limited cognitive capacity to recognize environmental change, ego protection, etc etc). And the situations where we're least successful Bayesians are just those situations which often drive us into therapy (assuming the situation matters). You could probably even frame a decent chunk of therapy interventions (especially REBT, CBT, and solutions-oriented therapies) as attempts to move people towards Bayesian patterns.

And the last piece, entropy being subjective, would be just the point of therapy and some of the interventions described in the other recent RLHF+ papers. From the LLM's point of view (pardon my anthropomorphism), policy entropy is zero (or near zero). But the researcher can see that there are alternative actions, and hence makes design choices to increase the probability that those choices will be tried in future training cycles. Likewise, one benefit of therapy is the broader perspective on humanity (especially on aspects of humanity tied to shame or cultural taboos which aren't often talked about in daily life) that we as individuals don't always see since we don't get as much privileged access to a large variety of other people's' inner lives.

Reply
Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
sdeture3moΩ110

"A teacher who used the voice of authority exactly when appropriate, rather than inflexibly applying it in every case, could have zero entropy and still be very adaptive/flexible." I'm not sure I would call this teacher adaptable. I might call them adapted in the sense that they're functioning well in their current environment, but if the environment changed in some way (so that actions in the current state no longer led to the same range of consequences in later states), they would fail to adapt. (Horney would call this person neurotic but successful.)

It's not so much about the shallowness or short-sightedness, as I understand it (though the teacher and people-pleasing friend examples were very simple policies). A child might, for example, develop an incredibly elaborate policy over the course of childhood to cope with an eruptive parent (be nice when mom is sober, be in your bedroom when she isn't, unless she calls you from the other room in which case you better show up quick, make sure there's beer in the house but not too much). Yet they might still fail to update that elaborate (and well-adapted policy) when they encounter women who remind them of their mothers later on in life, and this causes them to be misaligned with the new women in their lives, which causes suffering for all involved.

Or a successful executive might have developed incredibly elaborate policies for project management and interpersonal conflict that served them well in their corporate environment and led to many promotions...and then discover when they retire that there is some very low-entropy state in their policy that serves them very poorly when "managing projects" with their family in retirement ("Grandma retired and she treats everyone like her employee!"). And this causes misalignment with their family system, which causes suffering.

Does this elaboration of the metaphor improve the mapping between the therapeutic situation and the policy entropy collapse dynamic in the AI papers? 

(If I understand right, you can even point these two therapy examples more directly to the equation from the Cui et al. paper. In both examples, the client has made an exploitation/exploration trade-off that optimized performance. The successful executive was able to outcompete her colleagues in the workplace, but it came at the cost of selecting H=0, R = -a + b. This mirrors the casual observation that the siblings who adapted best to their troubled households growing up end up being the least able to adapt quickly to adulthood; that students who make the highest grades in school end up having more trouble adapting to the workplace or the dissertation stage of PhD programs; or that professionals who find the most success at work end up having more trouble adjusting to retirement...though these are of course very broad, hand-wavey observations with innumerable exceptions).

Reply
No wikitag contributions to display.
1Steering LLM Agents: Temperaments or Personalities?
1mo
0
2The Mirror Test: How We've Overcomplicated AI Self-Recognition
2mo
9
15Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy)
Ω
3mo
Ω
6