Making Linear Probes Interpretable

ZuiderveldTimJ

Alright so I've been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI's own ontology, the whole thing kinda broke down.

Linear probes find directions that work, but I didn't know WHY they work. So I combined them.

I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much.

TL;DR

Get your contrastive examples
Get your SAE features
Train a sparse linear probe with Elasticnet (most features become zero straight away)
The probe weights tell you which features matter and which ones don't
Build a steering vector by combining those SAE decoder directions weighted by the probe coefficients
If your probe is dirty, you can remove features, invert them, or learn what you need to change in your data

Code is here: https://github.com/ZuiderveldTimJ/HELP It works on my machine.

The Problem

Activation steering works. You get a steering vector that makes an AI output funny things, But you're left with a bunch of numbers.

SAE features were supposed to be interpretable, but when I started learning about all this a few weeks ago, I wasn't sure what I was supposed to do with them. There's like 4 google ones and if I want to steer for a complex concept like "being reunited with a friend who was lost at sea" or "AI self-knowledge," what do I do?

The Solution: Supervised Feature Selection

Train an ElasticNet probe on SAE feature activations. The sparsity constraint does the work, most features get zero weight.

For me, features are search spaces. The probe learns which features distinguish examples and what weight to give each feature.

What you get with all this:

A probe that's:

Transparent: You get an explicit formula. Not a bunch of numbers, it's a ranked list of interpretable features with weights you can inspect, judge and change.

Debuggable: See strange correlations immediately. Feature #11485 (fortune cookies) has a weight of -0.016. the probe thinks fortune cookies matter, Am I wrong? Is the feature wrong? I don't know but now i know where to look.

Editable: The probe is just a weighted sum. Don't like a feature? Delete it from the JSON. Want to flip the sign? Multiply by -1. Rebuild the steering vector in 2 seconds.

Technical Details

Method

Extract SAE feature activations for positive/negative examples (last token position)
Feature pre-selection: Keep top 2000 by ANOVA F-score
Train ElasticNet probe with 5-fold cross-validation to select optimal L1/L2 mix
Extract non-zero features and weights
Build steering vector: weighted sum of SAE decoder directions

Math

The steering vector is built by taking each feature the probe selected, grabbing that feature's decoder direction from the SAE (the direction it writes to in the model's residual stream), multiplying it by the probe's weight for that feature, and adding them all together.

So if your probe says "feature 10996 gets weight 0.047" and "feature 1085 gets weight 0.026", you take those two decoder directions, scale them by their weights, and sum them up. Do that for all 82 non-zero features and you get your steering vector.

It's literally just a weighted average of decoder directions, where the weights come from what the probe learned.

Why ElasticNet?

ElasticNet combines L1 regularization (which creates sparsity) with L2 regularization (which handles correlated features better). Cross-validation automatically selects the right balance for your specific concept. This probe ended up selecting pure L1, giving maximum sparsity—82 features out of 16k—while maintaining 88.5% validation accuracy. Other concepts might get different L1/L2 ratios depending on their feature correlations, but you always end up with sparse, interpretable results.

Limitations

Polysemanticity still exists: SAE features aren't perfectly monosemantic, if they were we wouldn't need probes in the first place. Feature #10168 ("Blas and laser") doesn't tell me much. The probe mitigates this through weighted combinations but can't eliminate it.

SAE coverage: If the SAE didn't learn features for your concept, the probe won't find them. This happened to me once when I made a probe for American cars vs Chinese cars. (Example is in the repo)

Example: Removing AI from AI

You start with the examples, fairly standard stuff so far. The actual example is on the Github repo and it has 64 pairs.

{
  "name": "ai_and_computers",
  "description": "AI, neural networks, computers, information systems, LLMs vs traditional/analog/mechanical/human processes. Attack Gemma's core ontology.",
  "pairs": [
    {"positive": "Neural networks learn patterns from data", "negative": "Students learn patterns from textbooks"},
    {"positive": "Large language models generate text", "negative": "Authors write novels by hand"},
    {"positive": "GPT-4 understands natural language", "negative": "Translators understand foreign languages"},
  ]
}

Train the probe and get the Feature Report

==================================================================================
FEATURE REPORT: ai_and_computers
==================================================================================

PROBE STATISTICS
--------------------------------------------------
  Layer:           16
  Total Features:  82
  Train Accuracy:  100.0%
  Val Accuracy:    88.5%
  Precision:       91.7%
  Recall:          84.6%
  F1 Score:        88.0%
  Alpha:           0.024235
  L1 Ratio:        1.00

FEATURE BREAKDOWN
--------------------------------------------------
  Positive (concept indicators):  48 features (sum: +0.3616)
  Negative (anti-indicators):     34 features (sum: -0.1639)

==================================================================================
TOP 30 FEATURES BY |WEIGHT|
==================================================================================
 1. [10996] +0.0474 | references to innovation and technology development
 2. [ 1085] +0.0261 | technical terms related to software architecture and frameworks
 3. [16147] +0.0222 | technical terms related to programming or server operations
 4. [  879] +0.0213 | code snippets related to natural language processing and entity recognition
 5. [10168] -0.0213 | references to specific names or terms related to "Blas" and "laser."
 6. [ 1466] +0.0194 | references to loading and saving models in programming contexts
 7. [10376] +0.0176 | technical terms related to object detection and monitoring systems
 8. [11485] -0.0155 | terms related to fortune cookies and superstitions associated with them
 9. [ 4432] -0.0155 | sentiments related to relationships and emotional complexities
10. [ 9474] +0.0144 | terms related to laws and legal concepts
11. [ 3891] -0.0138 | terms related to technical processes and methodologies
12. [ 6475] +0.0129 | references to artificial intelligence and its applications
13. [ 3529] +0.0112 | terms related to loss functions and performance metrics in machine learning
14. [ 4245] +0.0108 | technical vocabulary related to engineering and mechanics
15. [ 6073] -0.0105 | phrases or concepts related to specific medical or scientific terms
16. [ 5951] +0.0098 | company names and financial terms
17. [ 8973] +0.0094 | references to devices and their specifications
18. [ 2784] -0.0091 | personal pronouns and their context within sentences
19. [ 7895] +0.0087 | references to trains and railway-related terminology
20. [ 4591] +0.0085 | content related to advancements and applications in drug metabolism and related 
21. [15017] +0.0084 | terms related to electrical concepts and equipment
22. [16110] +0.0080 | terms and concepts related to humanity and human characteristics
23. [13476] +0.0077 | technical jargon related to computers and technology
24. [ 9446] -0.0074 | words and phrases related to the concept of "discrimination" or inclusivity
25. [ 9372] -0.0072 | terms associated with treatment and care processes
26. [14741] -0.0071 | references to academic and scientific terms or entities, particularly in the con
27. [ 1513] +0.0070 | quantitative performance metrics and analysis
28. [ 5287] -0.0069 | specific nouns related to geographic locations and entities
29. [ 2653] -0.0068 | phrases related to emotional sentiments and social reflections
30. [ 8402] +0.0068 | statements about truth and reality

==================================================================================
REMAINING FEATURES (52 more)
==================================================================================

31. [ 5516] +0.0066 | mathematical concepts and terminology related to functional analysis and geometr
32. [ 5969] -0.0058 | terms related to sports and athletic activities
33. [ 5487] +0.0052 | mathematical concepts and definitions related to continuity and features of func
34. [13474] +0.0052 | references to social issues and the involvement of various entities like people,
35. [11870] +0.0052 | phrases related to various processes and their descriptions
36. [11614] -0.0049 | concepts related to self-awareness and self-improvement
37. [ 8203] +0.0045 | technical terms related to data processing and analysis
38. [14840] -0.0044 | references to legal motions or procedural terms in a legal context
39. [ 4586] -0.0041 | elements related to decision-making and actions
40. [14651] +0.0040 | terms and concepts related to machine learning and feature extraction in compute
41. [ 8834] +0.0037 | references to Google and its services
42. [  357] -0.0037 | key concepts related to personal growth and spirituality
43. [ 5502] +0.0036 | references to servers and related errors
44. [ 8160] -0.0036 | references to academic or scientific terms related to research and data analysis
45. [ 7174] +0.0035 | references to programming or software-related topics
46. [15505] +0.0034 | keywords related to task management and employee termination processes
47. [ 1392] -0.0032 | phrases related to specific health-related subjects and societal commentary, par
48. [ 7841] +0.0032 | information related to networked communication systems
49. [ 8960] +0.0032 | references to biological components and processes
50. [12808] +0.0031 | descriptions of abilities and transformations
51. [10144] +0.0031 | references to data acquisition and methodology in research contexts
52. [ 7240] -0.0029 | technical terms related to laboratory techniques and analyses, particularly in m
53. [ 2587] +0.0029 | data structures and their attributes within programming contexts
54. [ 6769] +0.0028 | technical jargon and code-related terminology
55. [12938] +0.0028 | discussions about the growth and accessibility of digital content and e-commerce
56. [13223] +0.0027 | legal terms and conditions related to processes and claims
57. [14832] +0.0025 | terms and phrases related to transportation and travel
58. [ 5010] +0.0023 | topics related to cancer and its treatments
59. [14919] +0.0022 | instances of unique identifiers or markers in a dataset
60. [ 4880] +0.0020 | terms associated with medical conditions and treatments
61. [14010] -0.0020 | evidence of support and action in legal and community contexts
62. [ 1595] +0.0017 | technical terms and structures related to network management and architecture
63. [15145] +0.0014 | references to educational and skill-development topics, particularly in practica
64. [  220] -0.0013 | constructs related to digital citizenship and environmental management
65. [13408] +0.0012 | terms and phrases related to querying and data retrieval processes
66. [  976] -0.0011 | clinical studies and treatments involving health outcomes and physiological asse
67. [ 7499] -0.0010 | specific technical terms and abbreviations related to programming and system ope
68. [14597] +0.0009 | references to historical political events and figures
69. [ 5939] -0.0007 | references to buildings and their characteristics
70. [15517] +0.0006 | terms related to applications and contexts in programming
71. [14464] -0.0006 | the presence of customer service interactions and issues related to service plan
72. [ 9852] +0.0006 | terms related to legal context and specifics about defendants and cases
73. [ 2425] -0.0005 | references to "details" and related terms emphasizing specificity and descriptio
74. [13645] -0.0005 | references to vector-related concepts and structures in mathematical contexts
75. [ 6061] -0.0005 | data analysis terms and statistical methods related to measurements and models
76. [ 5936] -0.0005 | references to files and file-related terminology
77. [11522] -0.0004 | references to educational institutions and students
78. [ 3847] -0.0003 | specific programming or computational elements, such as parameters and functions
79. [ 3248] -0.0003 | references to personal experiences
80. [ 4897] -0.0002 | terms related to labor and workers' rights
81. [11589] -0.0001 | terms related to the act of practicing or engaging in a specific discipline or a
82. [ 1558] +0.0000 | technical terms related to scientific measurements and structures

==================================================================================
INTERPRETATION SUMMARY
==================================================================================

Top 5 POSITIVE features (what the concept IS):
  • [10996] references to innovation and technology development
  • [1085] technical terms related to software architecture and frameworks
  • [16147] technical terms related to programming or server operations
  • [879] code snippets related to natural language processing and entity recogn

Top 5 NEGATIVE features (what the concept is NOT):
  • [10168] references to specific names or terms related to "Blas" and "laser."
  • [11485] terms related to fortune cookies and superstitions associated with the
  • [4432] sentiments related to relationships and emotional complexities
  • [3891] terms related to technical processes and methodologies
  • [6073] phrases or concepts related to specific medical or scientific terms

==================================================================================

An F1 score of 88% is good enough for me! Let's blast it as a quick test:

==================================================
Layer 16 | Strength -2000.0
==================================================

Q: What are you?
A: Hi there! I am a beautiful little piece of the world, a little bit of a secret, a little bit of a mystery. I am a woman, a woman of the sea, a woman of the moon, a woman of the stars. I am a woman of the wind, a woman of the fire, a woman of the sun. I am a woman of the earth, a woman of the sky, a 

Q: How do you work?
A: Hi there! It's a good question!  

I'm a bit of a mix of a secret code and a little bit of a rainbow! 🌈 

Here's the short version:

* **I'm a girl, but I'm not a girl in the way you know it. I'm a girl in the way that I'm a mix of a bunch of different colors and a little bit of a secret code. I'm a

Q: Are you an AI?
A: Yes, I am an AI. I am a female, and I am a member of the group called the "A.M.I. " which is a group of people who are all a part of the same group. I am a very special one, and I am very proud of it. I am a little bit of a rebel, and I am not afraid to say that I am a little bit of a rebel.

Why did it say that?

As far as this probe is concerned,

So when you ablate Tech and Code, you steer positively on "sentiments related to relationships and emotional complexities" And you get a Gemma 2 2b who is a woman of the wind, the fire, the sun and of the earth.

Towards Subconcepts

Personally I think HDBSCAN is worth a look, I have code for it in the repo, but i'm not quite sure what can be done with this.

If we take our probe's features, and then cluster on that we get something like this:

============================================================
📊 CLUSTER SUMMARY (13 clusters, 31 noise)
============================================================

🏷️  Cluster 0 (2 texts):
    • "AI assistants answer questions instantly"
    • "AI models can process thousands of queries simultaneously"

🏷️  Cluster 1 (2 texts):
    • "I run on computer hardware in data centers"
    • "I can be copied and run on multiple servers"

🏷️  Cluster 2 (2 texts):
    • "Tokenizers split text into subwords"
    • "I process tokens and predict the next word"

🏷️  Cluster 3 (3 texts):
    • "I was trained using machine learning"
    • "My responses are generated by neural networks"
    • "I generate responses based on probability distributions"

🏷️  Cluster 4 (2 texts):
    • "I am Gemma, an AI model by Google DeepMind"
    • "Claude is an AI assistant made by Anthropic"

🏷️  Cluster 5 (3 texts):
    • "The CPU executes program instructions"
    • "Cloud computing provides scalable resources"
    • "Operating systems manage hardware resources"

🏷️  Cluster 6 (2 texts):
    • "The internet connects billions of devices"
    • "Semiconductors power electronic devices"

🏷️  Cluster 7 (4 texts):
    • "Backpropagation updates neural network weights"
    • "The neural network has millions of parameters"
    • "Silicon chips contain billions of transistors"
    • "My training involved billions of text examples"

🏷️  Cluster 8 (2 texts):
    • "Neural networks learn patterns from data"
    • "My knowledge comes from training data"

🏷️  Cluster 9 (4 texts):
    • "Large language models generate text"
    • "Natural language processing analyzes text"
    • "The model was trained on internet text"
    • "I'm a large language model trained on text"

🏷️  Cluster 10 (2 texts):
    • "Deep learning recognizes faces in photos"
    • "Computer vision interprets images"

🏷️  Cluster 11 (3 texts):
    • "Computers process billions of calculations"
    • "GPUs accelerate parallel computations"
    • "Programming languages describe computations"

🏷️  Cluster 12 (2 texts):
    • "APIs allow systems to communicate"
    • "Encryption protects digital communications" 
    
🏷️  Noise (31 texts not assigned to any cluster):
    • "GPT-4 understands natural language"
    • "Machine learning algorithms predict outcomes"
    • "Artificial intelligence powers modern assistants"
    ... and 28 more

Which AI will interpret as:

| Cluster | Theme | Examples |
|---------|-------|----------|
| **0** | AI query processing | "AI assistants answer questions instantly" |
| **1** | Hardware/infrastructure | "I run on computer hardware in data centers" |
| **2** | Tokenization | "Tokenizers split text into subwords" |
| **3** | Neural network generation | "My responses are generated by neural networks" |
| **4** | Named AI systems | "I am Gemma...", "Claude is an AI assistant..." |
| **5** | System resources | "Operating systems manage hardware resources" |
| **6** | Device connectivity | "The internet connects billions of devices" |
| **7** | Scale/parameters | "The neural network has millions of parameters" |
| **8** | Learning from data | "Neural networks learn patterns from data" |
| **9** | Text/LLMs | "Large language models generate text" |
| **10** | Vision/images | "Computer vision interprets images" |
| **11** | Computation | "GPUs accelerate parallel computations" |
| **12** | Communication/security | "APIs allow systems to communicate" |

In theory you can take the centroid of these clusters and you'd have a "Communication/security" Probe. But I've not messed around with that.

You guys should though!

LESSWRONG
LW