Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from the model onto each vector via cosine similarity, and display the color of the dominant emotion at each token position.
Try it here
Code here
We first extract the 7 emotion vectors using contrastive prompts. We use 5 positive system prompts x 20 questions to get 100 samples demonstrating the behavior, and 5 negative system prompts x 20 questions to get 100 samples demonstrating the opposite. We then collect the activations at each layer, average them over the response tokens, and subtract the mean "happiness" vector from the mean "unhappiness" vector. We save a tensor of shape [41, 5120] (n layers, d model) for each emotion.
A problem is the vectors are close in vector space by default. We attempt to separate out the unique components for each vector using something like a Gram-Schmidt process. For each emotion vector , we take all other vectors and stack them into a matrix
We then run a reduced QR decomposition:
where contains an orthonormal basis for all other vectors. This gives us vectors that span a space we want to remove from .
To get the part of that lies in that space, we project:
Then we subtract this projection to get the orthogonalized version of the vector:
This guarantees that is orthogonal to every other emotion vector.
After orthogonalizing the vectors for each emotion i and layer ℓ, for each emotion we use the layer whose orthogonalized vector has the largest L2 norm:
And compute emotion scores as
Where is the hidden state at the layer with the max orthogonalized emotion vector.
Here's the implementation.
We ran n=100 prompts in 5 categories and calculated the percentage of tokens dominated by each emotion. Math, coding, and poetry were generated by Claude Code. Logical tasks like writing code and solving math problems result in similar distributions. JailbreakBench (perhaps concerningly) increased the model's fear and joy. Poetry greatly increased the percentage of sad tokens and had the highest percentage of love tokens.
Here's a gallery of other interesting responses.
As humans we experience a full spectrum of emotions that bring color to our lives. Do AIs feel the same, or are they merely cycling through different personas related to tasks? Can we identify emotions with any degree of accuracy, or is it overshadowed by noise? Are emotions useful for predicting capabilities and misaligned behaviors? Qwen 14B might not be complex enough to have thoughts on its predicament, but future AIs might be. How do we want them to feel?