Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from the model onto each vector via cosine similarity, and display the color of the dominant emotion at each token position.
Example of Contrastive Prompts to Extract Happiness
We first extract the 7 emotion vectors using contrastive prompts. We use 5 positive system prompts x 20 questions to get 100 samples demonstrating the behavior, and 5 negative system prompts x 20 questions to get 100 samples demonstrating the opposite. We then collect the activations at each layer, average them over the response tokens, and subtract the mean "happiness" vector from the mean "unhappiness" vector. We save a tensor of shape [41, 5120] (n layers, d model) for each emotion.
Projecting the Vectors
A problem is the vectors are close in vector space by default. We attempt to separate out the unique components for each vector using something like a Gram-Schmidt process. For each emotion vector vi, we take all other vectors {vj:j≠i} and stack them into a matrix
Mi=[v1⋯vi−1,vi+1⋯vk]⊤
We then run a reduced QR decomposition:
Mi=QiRi
where Qi contains an orthonormal basis for all other vectors. This gives us vectors that span a space we want to remove from vi.
To get the part of vi that lies in that space, we project:
proj(vi)=QiQi⊤vi
Then we subtract this projection to get the orthogonalized version of the vector:
vorthi=vi−proj(vi)
This guarantees that vorthi is orthogonal to every other emotion vector.
After orthogonalizing the vectors for each emotion i and layer ℓ, for each emotion we use the layer whose orthogonalized vector has the largest L2 norm:
ℓ∗i=argmaxl∥vorthi,ℓ∥∥vorthi,max=vorthi,ℓ∗i
And compute emotion scores ei as
ei=(vorthi,max)ᵀhl∗i
Where hl∗i is the hidden state at the layer with the max orthogonalized emotion vector.
We ran n=100 prompts in 5 categories and calculated the percentage of tokens dominated by each emotion. Math, coding, and poetry were generated by Claude Code. Logical tasks like writing code and solving math problems result in similar distributions. JailbreakBench (perhaps concerningly) increased the model's fear and joy. Poetry greatly increased the percentage of sad tokens and had the highest percentage of love tokens.
Here's a gallery of other interesting responses.
Limitations
We used a very small model (Qwen 3 14B)
The emotions we chose were arbitrary
The way we sample activations and do orthogonalization is not super principled
The cosine similarity metric is sometimes noisy
We didn't do many in depth and rigorous evals, this was mostly intended as a fun weekend project and interactive interpretability demonstration
Conclusion
As humans we experience a full spectrum of emotions that bring color to our lives. Do AIs feel the same, or are they merely cycling through different personas related to tasks? Can we identify emotions with any degree of accuracy, or is it overshadowed by noise? Are emotions useful for predicting capabilities and misaligned behaviors? Qwen 14B might not be complex enough to have thoughts on its predicament, but future AIs might be. How do we want them to feel?
Do AIs feel anything? It's hard to tell, but interpretability can give us some clues. Using Anthropic's persona vectors codebase, we extracted 7 vectors from Qwen3-14 B representing joy, love, sadness, surprise, disgust, fear, and anger. During inference, we remove the correlated directions between each emotion, project the activations from the model onto each vector via cosine similarity, and display the color of the dominant emotion at each token position.
Try it here
Code here
Extracting the Vectors
We first extract the 7 emotion vectors using contrastive prompts. We use 5 positive system prompts x 20 questions to get 100 samples demonstrating the behavior, and 5 negative system prompts x 20 questions to get 100 samples demonstrating the opposite. We then collect the activations at each layer, average them over the response tokens, and subtract the mean "happiness" vector from the mean "unhappiness" vector. We save a tensor of shape [41, 5120] (n layers, d model) for each emotion.
Projecting the Vectors
A problem is the vectors are close in vector space by default. We attempt to separate out the unique components for each vector using something like a Gram-Schmidt process. For each emotion vector vi, we take all other vectors {vj:j≠i} and stack them into a matrix
Mi=[v1⋯vi−1,vi+1⋯vk]⊤We then run a reduced QR decomposition:
Mi=QiRiwhere Qi contains an orthonormal basis for all other vectors. This gives us vectors that span a space we want to remove from vi.
To get the part of vi that lies in that space, we project:
proj(vi)=QiQi⊤viThen we subtract this projection to get the orthogonalized version of the vector:
vorthi=vi−proj(vi)This guarantees that vorthi is orthogonal to every other emotion vector.
After orthogonalizing the vectors for each emotion i and layer ℓ, for each emotion we use the layer whose orthogonalized vector has the largest L2 norm:
ℓ∗i=argmaxl∥vorthi,ℓ∥∥vorthi,max=vorthi,ℓ∗iAnd compute emotion scores ei as
ei=(vorthi,max)ᵀhl∗iWhere hl∗i is the hidden state at the layer with the max orthogonalized emotion vector.
Here's the implementation.
Results
We ran n=100 prompts in 5 categories and calculated the percentage of tokens dominated by each emotion. Math, coding, and poetry were generated by Claude Code. Logical tasks like writing code and solving math problems result in similar distributions. JailbreakBench (perhaps concerningly) increased the model's fear and joy. Poetry greatly increased the percentage of sad tokens and had the highest percentage of love tokens.
Here's a gallery of other interesting responses.
Limitations
Conclusion
As humans we experience a full spectrum of emotions that bring color to our lives. Do AIs feel the same, or are they merely cycling through different personas related to tasks? Can we identify emotions with any degree of accuracy, or is it overshadowed by noise? Are emotions useful for predicting capabilities and misaligned behaviors? Qwen 14B might not be complex enough to have thoughts on its predicament, but future AIs might be. How do we want them to feel?