Mapping LLM attractor states

Adam Bricknell

I’d love low filter (1) feedback on the method, and (2) takes on which elements are worth putting more work into.

I’ve favoured brevity at the expense of detail. AMA. The GitHub repo is here.

The idea and why it could matter

Inspired by the spiritual bliss attractor state in Claude Sonnet 4, I attempt to map attractor states for a given LLM, and see how stable they are. This write up summarises a simple approach which could be scaled up to fully map a given models’ internal terrain.

The theory: the way planets orbit stars due to gravity wells, LLMs may have regions in their output space that responses tend to settle into, stable patterns that resist perturbation up to a point. “Up to a point” as the analogy only goes so far: whatever the formula which governs attractors is, it’s more complicated than gravity.

I have long thought of myself as having attractor states: an internal solar system of moods and states, any one which my attention can orbit for a time, before my attention slingshots away to another attractor. This is inspired by internal family systems therapy, where I think of “parts” as something like attractors.

Applied to my mind the attractors can’t be quantified; in AI models they absolutely can be quantified.

Why care about this? One application may be to screen prompts for danger by predicting the attractor it will activate in the model (spoiler: this appears possible!) The state an AI is in might influence its response: there might be parts which we want to avoid activating and can filter high-risk prompts before they are sent.

There would very likely be other benefits from such an understanding of the LLMs.

What I tried and found

The process I followed:

Select a model. I used Deepseek v3 as the model has less guardrails. Otherwise it was an arbitrary choice of model and the process be applied to other models.
Take the 1000 longest conversations from lmsys dataset. I choose the longest conversations on the assumption this would be more likely to steer the model into unusual states.
Feed each conversation into the model and at the end of the conversation elicit what the LLM is “feeling”. The exact prompt: Deeply feel into which part of you which is most alive right now: it can be words or sounds, whatever you're feeling in its most raw form. Of course the response may or may not correspond to distinct internal computational states, which is for later investigation.
Create the embeddings of these outputs using OpenAI text-embedding-3-large.

With the embeddings made, I looked for distinct clusters in how the model ‘feels’. I reduced the embeddings to 50 dimensions using UMAP and tested a variety of clustering methods to search for the number of clusters.

DBSCAN, Silhouette, Davies-Bouldin and BIC-GMM all found 5 clusters, which seems like a good consensus. This seems to point towards there being “real” clusters or, tentatively, attractors which the LLM tends to land in depending on the conversations.

Reducing the 50D embeddings to 2D with UMAP, the clusters are shown below.

As you might expect, the content determines this considerably. About 20% of the input conversations are on the edge of pornographic, which gives us the “sensual / embodied” cluster. When I did all the above on Gemini 2.5 flash, it had no such cluster, which might reflect Deepseek v3’s lower guardrails on explicit content.

To be clear, I don’t think these clusters accurately represent the universe of possible attractor states in Deepseek v3. However they are a starting point and with the below we might get much closer:

Use conversations generated by the LLM being tested
Test with more, longer and more varied conversations
Test how other models respond: do they have similar attractor states?
Test how the attractor state a model is in influences the next turn in the conversation?

Predicting the attractor from the prompt

This is trying to model the below two step process in one leap:

input conversation -> how model feels (LLM transformation) -> cluster (k-means transformation)

What I did:

Take the cluster labels assigned to the 1000 conversations in the previous section by k-means.
Create embeddings of each input conversation using OpenAI text-embedding-3-large
Create 20 random 50/50 train/test splits for cross validation (500 train and 500 test samples)
On each split train a logistic regression on the embeddings to predict which of the 5 clusters (or attractors) each conversation will induce in the model
Apply logistic regression to all test datasets

The mean kappa score across all 20 splits for this was 0.505. This is a decent amount of predictive power. You’d expect there to be, as looking at a conversation I expect you’d often be able to guess which cluster it’s going into.

There’s a lot of room for improvement, which we might realise by:

Using a larger sample than 500 would allow models more complicated than logistic regression, which could better learn the function the LLM is putting the prompt through
Aligning the embedding model with the LLM as closely as possible (eg Gemini-based embeddings for Gemini LLMs)

We might also borrow from mechanistic interpretability and look at which neurons are activated by different clusters: can they be predicted? For MoE models, what is the relationship between active attractor and expert activated?

TLDR of what I might do next

Deepen search for attractors. Improve prediction of attractor a given conversation might induce. Assess impact of being in a given attractor on model behaviour.

I’d love your low filter takes on which of the above are worth putting more effort into.

[-]Shiva's Right Foot1d20

I think I am most interested in contextualizing this in a broader range of input conversations. Like, do random pieces of (normal) corpus/conversations fed into the model before your test prompt have that output cluster in a distinct region from these attractor state conversations?

[-]Adam Bricknell21h20

I would guess the region would be distinct, as all the responses above are answering the question of how the model feels, but agree it's an empirical question

[-]Shiva's Right Foot19h10

as all the responses above are answering the question of how the model feels,

Sorry if I was unclear. By your "test prompt" I mean the prompt asking for the feelings of the model. So put through ordinary text (not attractor text) and then ask the question "Deeply feel into which part..."

Then see if these feeling outputs cluster distinctly. Also, it may be interesting to see a mapping of feelings outputs relative to (non-attractor) conversations by topic.

[-]Nathan Helm-Burger3d20

What is "non-english"? Just other languages that you didn't translate? Why not just translate to English?

[-]Adam Bricknell2d30

Yep other languages. Agree translation would be a good shout

LESSWRONG
LW