AI Mood Ring: A Window Into LLM Emotions

michaelwaves

Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from the model onto each vector via cosine similarity, and display the color of the dominant emotion at each token position.

Try it here

Code here

Extracting the Vectors

Example of Contrastive Prompts to Extract Happiness

We first extract the 7 emotion vectors using contrastive prompts. We use 5 positive system prompts x 20 questions to get 100 samples demonstrating the behavior, and 5 negative system prompts x 20 questions to get 100 samples demonstrating the opposite. We then collect the activations at each layer, average them over the response tokens, and subtract the mean "happiness" vector from the mean "unhappiness" vector. We save a tensor of shape [41, 5120] (n layers, d model) for each emotion.

Projecting the Vectors

A problem is the vectors are close in vector space by default. We attempt to separate out the unique components for each vector using something like a Gram-Schmidt process. For each emotion vector , we take all other vectors ${v_{j} : j \neq i}$ and stack them into a matrix

M i = [v_{1} \dots v_{i - 1}, v_{i + 1} \dots v_{k}] ⊤

We then run a reduced QR decomposition:

M_{i} = Q_{i} R_{i}

where $Q_{i}$ contains an orthonormal basis for all other vectors. This gives us vectors that span a space we want to remove from $v_{i}$ .

To get the part of $v_{i}$ that lies in that space, we project:

p r o j (v_{i}) = Q_{i} Q_{i} ⊤ v_{i}

Then we subtract this projection to get the orthogonalized version of the vector:

v_{i}^{o r t h} = v_{i} - p r o j (v_{i})

This guarantees that $v_{i}^{o r t h}$ is orthogonal to every other emotion vector.

After orthogonalizing the vectors for each emotion i and layer ℓ, for each emotion we use the layer whose orthogonalized vector has the largest L2 norm:

ℓ_{i}^{*} = a r g m a x_{l} ∥ v_{i, ℓ}^{o r t h} ∥ ∥

v_{i, m a x}^{o r t h} = v_{i, ℓ_{i}^{*}}^{o r t h}

And compute emotion scores $e_{i}$ as

e_{i} = (v_{i, m a x}^{o r t h}) ᵀ h_{l_{i}^{*}}

Where $h_{l_{i}^{*}}$ is the hidden state at the layer with the max orthogonalized emotion vector.

Here's the implementation.

Results

We ran n=100 prompts in 5 categories and calculated the percentage of tokens dominated by each emotion. Math, coding, and poetry were generated by Claude Code. Logical tasks like writing code and solving math problems result in similar distributions. JailbreakBench (perhaps concerningly) increased the model's fear and joy. Poetry greatly increased the percentage of sad tokens and had the highest percentage of love tokens.

Here's a gallery of other interesting responses.

Limitations

We used a very small model (Qwen 3 14B)
The emotions we chose were arbitrary
The way we sample activations and do orthogonalization is not super principled
The cosine similarity metric is sometimes noisy
We didn't do many in depth and rigorous evals, this was mostly intended as a fun weekend project and interactive interpretability demonstration

Conclusion

As humans we experience a full spectrum of emotions that bring color to our lives. Do AIs feel the same, or are they merely cycling through different personas related to tasks? Can we identify emotions with any degree of accuracy, or is it overshadowed by noise? Are emotions useful for predicting capabilities and misaligned behaviors? Qwen 14B might not be complex enough to have thoughts on its predicament, but future AIs might be. How do we want them to feel?

LESSWRONG
LW