I think I am most interested in contextualizing this in a broader range of input conversations. Like, do random pieces of (normal) corpus/conversations fed into the model before your test prompt have that output cluster in a distinct region from these attractor state conversations?
I would guess the region would be distinct, as all the responses above are answering the question of how the model feels, but agree it's an empirical question
as all the responses above are answering the question of how the model feels,
Sorry if I was unclear. By your "test prompt" I mean the prompt asking for the feelings of the model. So put through ordinary text (not attractor text) and then ask the question "Deeply feel into which part..."
Then see if these feeling outputs cluster distinctly. Also, it may be interesting to see a mapping of feelings outputs relative to (non-attractor) conversations by topic.
I’d love low filter (1) feedback on the method, and (2) takes on which elements are worth putting more work into.
I’ve favoured brevity at the expense of detail. AMA. The GitHub repo is here.
The idea and why it could matter
Inspired by the spiritual bliss attractor state in Claude Sonnet 4, I attempt to map attractor states for a given LLM, and see how stable they are. This write up summarises a simple approach which could be scaled up to fully map a given models’ internal terrain.
The theory: the way planets orbit stars due to gravity wells, LLMs may have regions in their output space that responses tend to settle into, stable patterns that resist perturbation up to a point. “Up to a point” as the analogy only goes so far: whatever the formula which governs attractors is, it’s more complicated than gravity.
I have long thought of myself as having attractor states: an internal solar system of moods and states, any one which my attention can orbit for a time, before my attention slingshots away to another attractor. This is inspired by internal family systems therapy, where I think of “parts” as something like attractors.
Applied to my mind the attractors can’t be quantified; in AI models they absolutely can be quantified.
Why care about this? One application may be to screen prompts for danger by predicting the attractor it will activate in the model (spoiler: this appears possible!) The state an AI is in might influence its response: there might be parts which we want to avoid activating and can filter high-risk prompts before they are sent.
There would very likely be other benefits from such an understanding of the LLMs.
What I tried and found
The process I followed:
With the embeddings made, I looked for distinct clusters in how the model ‘feels’. I reduced the embeddings to 50 dimensions using UMAP and tested a variety of clustering methods to search for the number of clusters.
DBSCAN, Silhouette, Davies-Bouldin and BIC-GMM all found 5 clusters, which seems like a good consensus. This seems to point towards there being “real” clusters or, tentatively, attractors which the LLM tends to land in depending on the conversations.
Reducing the 50D embeddings to 2D with UMAP, the clusters are shown below.
As you might expect, the content determines this considerably. About 20% of the input conversations are on the edge of pornographic, which gives us the “sensual / embodied” cluster. When I did all the above on Gemini 2.5 flash, it had no such cluster, which might reflect Deepseek v3’s lower guardrails on explicit content.
To be clear, I don’t think these clusters accurately represent the universe of possible attractor states in Deepseek v3. However they are a starting point and with the below we might get much closer:
Predicting the attractor from the prompt
This is trying to model the below two step process in one leap:
input conversation -> how model feels (LLM transformation) -> cluster (k-means transformation)
What I did:
The mean kappa score across all 20 splits for this was 0.505. This is a decent amount of predictive power. You’d expect there to be, as looking at a conversation I expect you’d often be able to guess which cluster it’s going into.
There’s a lot of room for improvement, which we might realise by:
We might also borrow from mechanistic interpretability and look at which neurons are activated by different clusters: can they be predicted? For MoE models, what is the relationship between active attractor and expert activated?
TLDR of what I might do next
Deepen search for attractors. Improve prediction of attractor a given conversation might induce. Assess impact of being in a given attractor on model behaviour.
I’d love your low filter takes on which of the above are worth putting more effort into.