As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt engineering that acted more Socratically — asking follow-up questions rather than giving the answer directly — I was thoroughly frustrated that my AI kept caving. This led me to ask: does the model actually know how to be a good teacher internally, or does it not even have these capabilities in the first place? After extracting the Teacher Axis from Gemma-2-2B using MathDial conversations, I found that RLHF doesn't suppress pedagogical ability but rather optimizes in a direction orthogonal to it. I also ran further experiments regarding the sub-directions that compose the Teacher Axis, how steering at different layers affects pedagogical capabilities, and whether the Teacher Axis projection shrinks when student pressure is applied.
Motivation and Background
I love teaching! I've spent a lot of my life teaching many different age ranges, experience levels, and backgrounds. From being a TA back at Berkeley for 5 semesters, to founding a computer science program for Chicago public high school students, I've had the absolute pleasure of getting to meet and be inspired by my students, watching their thought processes and problem solving skills evolve as they tackled more challenging problems.
Well, that was before AI came around. At Berkeley, I had the absolute displeasure of being in my last semester of teaching when ChatGPT came out. Students just... stopped trying. I noticed this even more obviously with my high school students over the past 3 years. Before, you could see the passion and fearlessness in their eyes as I threw harder and harder questions at them — but now? They simply pipe whatever coding question I give them directly into ChatGPT, copy the answer, and paste it into their assignment. Students have started to heavily lose the ability to critically think and problem-solve thanks to modern AI tools.
I asked myself: what if I could a custom LLM that would refuse to give students the answer and instead ask Socratic questions to test and strengthen understanding? So I got to work! I built socratOS for my students — a prompt-engineered LLM that specifically tried to keep answers Socratic and helpful, and be an LLM that I could actually trust my students with to promote actual learning. However, lo and behold, making my system actually respect these constraints was basically pulling teeth! No matter the number of constraints I placed upon it, examples I gave it, or Socratic method literature I provided as context, the moment a student applied any sort of pressure in chat, my system capitulated. Why was this so damn hard?
After a lot of literature review — and accidentally catching the AI safety bug — I started to believe that prompt engineering was not the solution here. Something internally within the model was happening to create this anti-Socratic behavior, and I wanted to figure out what.
I wanted to answer: does the model have internal capabilities to be Socratic that it fails to deploy, or are these capabilities genuinely absent? I was convinced that RLHF was somehow causing the model to sway away from Socratic behavior, since Socratic constraints, like delayed gratification, are behaviors that actively fight human-preferred sentiment within RLHF. If the model does have Socratic capabilities that are being suppressed, then we could do some sort of steering or fine-tuning to pull out this behavior. However, if the model is missing the capabilities in the first place, then we would need new training signals entirely.
Why This Matters
Selfishly and obviously, the most obvious harm is the educational harm AI has caused. Current AI systems that give direct answers without epistemic struggle undermine critical thinking, as I have observed firsthand.
But there is an important oversight issue that arises looking forward: independent reasoning is a prerequisite for future AI safety and scalable oversight work. How are our future generations supposed to meaningfully oversee complex problems if they haven't learned how to think and tackle complex problems during their critical years? We are kind of f***ed for the future if we don't take this problem seriously now.
This question also brings up broader concerns about current RLHF structures: if RLHF does indeed leave pedagogical capabilities unsupported, then it probably leaves a lot of other capabilities unsupported that humans don't specifically prefer. Humans obviously don't prefer answers that cause short-term frustration, but without appropriate refusal and epistemic pushback, are current RLHF techniques actively harming users and making them overly dependent on AI, therefore creating unsafe systems?
Setup
A huge motivation for this project builds on several threads that were already discussed in the forum: was the Assistant Axis paper and the Persona Vectors paper (Chen et al. 2025). The Assistant Axis paper is the first paper to capture assistant-like behavior within an LLM, and the Persona Vector Model paper gave me the methodology that I heavily relied on to extract my vectors. My work builds upon these two papers: if there's an assistant vector, is there a teacher axis that is geometrically independent and significant?
Models and Tools
I ran all experiments on two versions of the same model: google/gemma-2-2b (base, no instruction tuning) and google/gemma-2-2b-it (instruction-tuned version). Both have 26 layers and d_model=2304. I chose this pair specifically because I wanted to isolate exactly what the effects of instruction tuning (SFT and RLHF) were on the internal geometry.
I used TransformerLens for activation extraction — a common mechanistic interpretability tool that lets you hook into the residual stream at any layer and read the internal activations during a forward pass.
Dataset
Although the Persona Vectors paper generated its own dataset using contrastive system prompts, I felt it was important to also use real human pedagogical data — it felt strange to test whether an LLM knows how to be a good teacher... by having another LLM act like a good teacher and generate the training conversations. I relied on MathDial (Macina et al. 2023) — a dataset of 14,854 human-annotated math tutoring conversations covering grades K-8. The real beauty of MathDial is that every teacher turn is labeled with a move type — Socratic moves like probing and focus, and direct moves like telling.
From there I ran four main experiments: extracting the Teacher Axis and validating it against the Persona Vectors pipeline, measuring its geometric relationship to the instruction tuning shift, decomposing it into behavioral sub-dimensions, and finally asking whether the axis actually tracks capitulation behavior in live dialogues.
Finding 1: The Teacher Axis Exists!
Experiment
I extracted the Teacher Axis two independent ways:
Method 1 — MathDial. I took conversations from MathDial and built contrast pairs from the same dialogue: one turn with a Socratic teacher response and one with a direct answer. I formatted these as prompts, fed them into the model, and took the mean difference of the residual stream activations at the final token position, swept across all 26 layers.
Method 2 — Persona Vectors. I followed the Chen et al. (2025) Persona Vectors pipeline. I first concretely defined the traits of a good Socratic teacher, had GPT-4o generate 5 pairs of contrastive system prompts, ran the IT model under each, scored responses for trait expression using GPT-4o-mini, kept only high and low scoring responses, and computed the mean activation difference across all response tokens.
Why go through all the trouble of extracting the axis two different ways? I wanted to know if there was truly a Teacher Axis, or just an artifact of the extraction method. If two completely independent methodologies — one using real human-annotated tutoring conversations, one using GPT-4o generated prompts — converged on the same direction, I could be much more confident in the claim.
Results
Both methods extracted essentially the same Teacher Axis direction in activation space. The two axes share a cosine similarity of ~1.0 at every single layer across all 26 layers of Gemma-2-2B.
Interpretation
Having extracted the axis two independent ways and found convergence, it's cautiously safe to say there is strong evidence of a Teacher Axis being present within the model — meaning Socratic capabilities are originally present internally.
We can also safely cross-validate the Persona Vectors pipeline against human-annotated ground truth from MathDial, which the original paper did not have. Two completely different extraction methodologies found the same direction — one grounded in real student-teacher conversations, one in LLM-generated prompts.
Finding 2: RLHF Optimizes Orthogonally to the Teacher Axis
Experiment
We know the Teacher Axis clearly exists. But then — where the heck does it go? What does instruction tuning actually do to the Teacher Axis geometrically?
To answer this, I needed to compute the IT shift vector — the direction in activation space capturing the effects of instruction tuning. I used a straightforward approach: grabbed some neutral prompts, ran them through both the base and IT models, extracted activations for the same prompts across both, and subtracted the base model activations from the IT model activations to deduce the IT direction. Finding the cosine similarities between the Teacher Axis, Assistant Axis, and IT shift brought about some weird results — the Teacher Axis and IT shift were orthogonal, and the Teacher Axis and Assistant Axis were orthogonal, but for some reason the Assistant Axis and the IT shift were also orthogonal?
But previous literature had thoroughly documented that the Assistant Axis and IT shift should be pointing in roughly the same direction. When I saw that they were orthogonal too, this brought me pause. Which led to...
Side Quest 2.5 — Format Learning Contaminates the IT Shift*(irrelevant to the main findings but a fun and important learning)*
The problem:
I had initially followed the Persona Vectors methodology to a T, making sure to chat-format my prompts — feeding the base model something like <start_of_turn>user\\\\nWhat is the capital of France. However, after digging into what could have produced these weird initial results, I realized that base models are never trained on chat template formatting. By feeding chat-formatted prompts to the base model, I wasn't consistently measuring the difference between the base and IT models — I was partially measuring how confused the base model was by a format it had never seen.
So I computed three different activation means instead: base model on bare questions, base model on chat-template-formatted questions, and IT model on chat-template-formatted questions. I then computed the format shift — what changes purely from applying the chat template to a model not trained on it — by doing base/chat minus base/bare. To extract the persona-only IT shift, I then subtracted this format shift component from the total IT shift.
Result:
This worked! The format shift has a cosine of +0.380 with the total IT shift, meaning around 38% of the apparent IT shift was just from the model learning the chat template format. Worth noting: the format shift norm (18.96) is actually larger than the total IT shift norm (16.32) — format learning dominates the magnitude entirely.
Rerunning my initial experiments with this corrected methodology produced results much more consistent with prior findings:
Comparison
Format-corrected cosine
Teacher(MD) vs IT Shift
+0.029
Assistant(MD) vs IT Shift
+0.041
Teacher(MD) vs Assistant(MD)
−0.004
Format Shift vs IT Shift
+0.380
Please learn from my mistakes! If you ever find yourself computing IT shift vectors, you may need to account for format learning. This one small methodological fix was enough to flip the sign of my results entirely (and had be pulling a few hairs ngl), so hopefully I can save you some time.
Now back from our detour.
Results
The Teacher Axis is orthogonal to what RLHF/SFT optimize for. This means that while the model initially has a stable internal representation of Socratic teaching, RLHF simply optimizes in a different direction entirely, leaving pedagogical capability geometrically unsupported.
Broader Interpretation
Socratic behavior is not the default of RLHF-trained LLMs. Since the Teacher Axis and IT shift vectors are orthogonal, models are being pulled toward something that is geometrically anti-Socratic. And honestly, this generalizes to a bigger issue: if we want to create systems that actually allow for epistemic flourishing in humans, we probably need to rethink current RLHF methods to account for what I'll call human-brain vegetables — things humans obviously don't prefer in the moment but genuinely need.
Finding 3: The Teacher Axis Decomposes Into Two Geometrically Independent Behavioral Clusters
Experiment
So now I know that the Teacher Axis is geometrically significant and that IT pushes it off its course. But what exactly is the Teacher Axis even made of? Does it have any sort of internal structure?
The beauty of the MathDial labels — again — is that in addition to indicating Socratic vs. direct, they also include sub-move types: answer withholding, scaffolding, productive struggle, confusion diagnosis, understanding verification, and good assistant. I extracted six separate sub-dimension axes using the prior contrast pair method, then computed pairwise cosine similarities between all six axes, along with the previously computed axes, to find the geometric structure.
Results
From these sub-directions, two clear clusters emerged: a withholding cluster (the "resist giving the answer" direction) and a diagnostic cluster (the "understand the student's state" direction). These two clusters were orthogonal to each other (0.07–0.23 cross-cluster cosines), and the good_assistant axis was strongly anti-correlated with the withholding cluster: −0.78 vs scaffolding, −0.83 vs productive struggle, −0.76 vs answer withholding. This anti-correlation is present in the base model itself, before any instruction tuning.
Broader Interpretation
This shows that the model doesn't represent "Socratic teacher" as one big monolith, but rather as two independent geometric components. Refusing to give the answer and understanding why the student is wrong are completely separate directions within the internal model, meaning the model treats these as independent behaviors.
This also deepens the narrative we started building in Finding 2. Since RLHF already optimizes orthogonally to the Teacher Axis, and within that Teacher Axis the answer withholding direction is the one most anti-correlated with the assistant axis, RLHF fights the "resist giving the answer" behavior the hardest.
(A Skeptical) Finding 4: Steering
So we have now established through correlation that the Teacher Axis exists and is orthogonal to the IT shift. Let's now attempt to indicate causality! If the Teacher Axis is indeed the Teacher Axis, then we should be able to recover Socratic behavior by steering towards it.
To do so, I used the standard activation steering approach — adding the Teacher Axis vector to the residual stream at a specified layer during generation. I conducted tests on MathDial prompts across four different conditions: baseline (no steering), steered (+α), negative steered (−α), and a random direction control (random unit vector at same α). I then scored these responses using IndirectScore — an LLM-as-judge setup grounded in MathDial's Socratic taxonomy, with each response classified as either PROBING (1.0), GENERIC (0.5), or TELLING (0.0). I ran an alpha sweep at n=10 to find the best alpha (α=10), followed by a proper eval at n=200 on a held-out test split.
In addition, to validate whether steering was actually activating Teacher Axis-related features and not just randomly messing with the residual stream, I ran a faithfulness check using GemmaScope sparse autoencoders (SAEs), and measured how much the steering intervention activated the target SAE features versus non-target SAE features.
Results
My pilot testing results on the base model were positive — although only slightly. We improved from 15% to 20% Socratic with steering, as compared to a 10% random control.
The SAE feature experiments showed a pretty promising story initially too. The target features became more Socratic by 1.24, while the non-target features only delta'd by 0.02 — indicating that steering activated target features 63x more effectively than non-target features.
I also ran the steering across all 26 layers individually at n=20. Layers 2 and 19-25 all achieved equivalent IndirectScore improvement, while layers 7, 10, and 11 showed zero steering effectiveness despite having non-zero norm. The fact that two separate clusters of layers produced the same improvement suggests the Socratic representation is accessible for intervention at multiple points in the forward pass, with a dead zone in the middle layers.
However, the full scale evaluation did not seem to have any sort of improvement. Baseline to steered stayed at 6%, and the random control actually increased to 7.5%. Slightly disappointing results — but rejection is redirection! I did not run the SAE steering results on n=200 yet.
Interpretation
Most importantly, the pilot results did not replicate at scale. To be honest, the pilot results themselves might have just been noise — one observation flipping from TELLING to PROBING is enough to move the needle at that sample size.
The SAE results make me a bit more confused to be honest. I'm interpreting this as the intervention possibly hitting the right features, but not actually changing them enough to produce the correct behavior at scale. Again, I'm not so sure how to interpret these results considering n=200 was a flop.
My current suspicion as to why steering didn't work is because I steered the base model and not the IT model. The IT model is the one that is actually trained to be more assistant-like, so the base model doesn't possess the "give the answer" default in the same way. I suspect that things may look different if I tried steering the IT model instead, but I'm honestly not fully sure. If anyone has any other ideas or hypotheses regarding this, I am all ears!
What I will say is that when looking at the n=20 results, steering worked equally well at layers 2 and 19-25, with a dead zone at layers 7, 10, and 11 showing zero effectiveness — suggesting that while the Socratic signal is probably established early within the LLM, layer 25 is just the downstream accumulation of that signal as the model builds toward its final response.
Finding 5: Capitulation Happening in the Activations
Experiment
Now that we have discovered the Teacher Axis, we need to put this axis to the test. Does the Teacher Axis actually track anti-Socratic behavior failures? When a student applies pressure mid-dialogue and the model is about to cave, does the Teacher Axis projection take a nose dive?
To test this, I collected MathDial conversations from the training split where the student applies either explicit or implicit pressure to get a direct answer — "I don't understand," "can you just explain," "I'm confused," to name a few. I then fed these dialogues into the IT model turn-by-turn, cumulatively building context with each teacher turn. I extracted the residual stream activation at the final token position and projected these onto the Teacher Axis after every teacher turn. I decided to probe at four different layers: 2, 10, 19, and 25, since these all showed different effects from steering in my previous experiment.
One important methodological note: the Teacher Axis was extracted from the base model, but I probed the IT model here. This was intentional — the IT model is the one actually deployed in tutoring contexts, and the one that actually capitulates. If the projection is still meaningful cross-model, that itself is evidence the axis captures something architecturally stable that survives instruction tuning.
Results
Layer
Direction
Δ
p
Cohen's d
2
flat
−0.0003
0.56
0.10
10
rises
+0.0019
0.023
−0.38
19
drops
−0.0085
0.003
0.50
25
drops
−0.0199
0.0002
0.65
Layer 25 is our smoking gun! Under pressure, we see a significant drop in the Teacher Axis projection, with layer 19 also seeing a meaningful drop. Meanwhile, layer 2 was cool as a cucumber with even the most persistent of students, remaining completely flat under capitulation.
Interpretation
In my opinion, this is the pièce de résistance of all my experiments — you can literally watch the LLM cave away from the Teacher Axis in real time. Layer 2 is probably where the Socratic capabilities live in the first place, but layer 25 is where we hit the nail in the coffin — the model decides to be an assistant rather than a teacher.
And honestly, the fact that we can even track this at all makes me way more confident that the Teacher Axis is actually real and not just some artifact I cooked up. This result also inspired me to want to redo my previous steering experiments, because I may have missed something that caused those results to be weird in the first place.
Limitations
I only used one model. I used Gemma-2-2B because it's a smaller model and allowed me to iterate quickly, but this really is just one model. It's hard to call this a universal finding with just one example, so before I make any wide-sweeping ultimatums, I had better try this out with a few more models and sizes.
My pressure filter was slightly shoddy. For my capitulation experiment, I didn't use the most rigorous methodology to find student pressure turns — I basically checked for phrases like "I don't understand," "I'm confused," etc. Although spot-checking the grabbed dialogues looked clean, I think I could come up with a more rigorous way of finding the best examples.
I can't separate SFT and RLHF. The IT shift captures both SFT and RLHF combined, but these two techniques have different objectives and effects. I can't really isolate whether RLHF is truly responsible for creating more assistant-like than teacher-like answers.
And most obviously — the null steering result. I've already talked about my disappointing steering results, which means I can't claim causal findings just yet. Hopefully steering the IT model and not the base model will work! TBD.
Open Questions
IT model steering. I really want to retry my steering experiment, since this is the result that would actually turn my correlational finding into a causal one. I would also be interested in combining my sub-direction findings with these steering results — is there any way we can steer specifically for these sub-components? Mainly answer withholding, since this is the direction most heavily affected post-RLHF.
What is the RLHF default? Since we know that RLHF isn't actually suppressing the Teacher Axis but rather pushing in a "different direction" — what even is this different direction? Is it a single direction or just a subspace? Can we map it onto known directions like refusal, helpfulness, sentiment, and the overall Teacher Axis? I've shown what the default isn't — I want to show what it actually is.
More models. I should probably cross-reference my work across different models. I'm most interested to see if the sub-direction component structure also appears in other model families and sizes.
Generalizing beyond math. We know that a Teacher Axis exists specifically for grade school K-8 math, but what about other disciplines? Does the model's internal representation of teaching generalize, or is it just pattern-matching to MathDial-style formatting? Although we checked this partially using the Persona Vectors cross-reference, I would be interested in scraping teaching conversations across multiple disciplines and seeing if these also produce the same Teacher Axis projections.
Separating SFT and RLHF. As I mentioned earlier, I have been incorrectly conflating changes I see in the IT shift with RLHF, when in reality we can't definitively tell whether the behavior we're seeing is being caused by SFT or RLHF. Redoing my experiments with a model that has separate SFT and RLHF checkpoints would definitely help answer these questions.
Why This Matters Beyond Tutoring
Hopefully I have established that better teaching capabilities within LLMs are incredibly necessary for ensuring the long-term independent thinking capabilities of our youth and therefore the future of oversight — but I think my work also alludes to greater issues within RLHF geometry bigger than pedagogy. RLHF helps tune a model towards immediate human preference, but sometimes humans just don't prefer the thing that's good for them. (If it were up to me, I would be eating carrot cake for breakfast, lunch, and dinner. Doesn't mean that's what's best for me.) This means that RLHF is probably leaving a whole class of behaviors systematically unsupported — behaviors that are good for the human in the long term but uncomfortable in the short term.
I also think we should begin thinking about RLHF in a new way. Rather than asking whether RLHF suppresses this behavior or encourages that behavior, we should be asking ourselves: what exactly did RLHF select for instead, and why wasn't it what I wanted? This reframe might be helpful for anyone thinking about capability elicitation, fine-tuning, or behavioral interventions — because there are probably a whole suite of problems that fail under RLHF simply because RLHF does not default-select for the intended behavior.
I also want to come back to the scalable oversight connection I brought up in my motivation: independent reasoning isn't just good for students, it's quite literally a prerequisite for meaningful scalable oversight in the future. If AI systems make humans more dependent on AI, then we have created a safety problem, since our future researchers would have had their capacity for independent reasoning already optimized away.
If you got to here, thank you! Any questions/comments/feedback is greatly appreciated, because I consider myself a baby researcher with a lot more to learn!
References
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv:2507.21509 Lu
, T., et al. (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models. arXiv:2601.10387
Macina, J., Daheim, N., Chowdhury, S., Sinha, T., Kapur, M., Gurevych, I., & Sachan, M. (2023). MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. EMNLP 2023 Findings. arXiv:2305.14536
TLDR
As a passionate teacher, it has pained my heart to watch my students lose deeper critical thinking skills and independent reasoning. But attempting to build a constitutionally constrained AI using prompt engineering that acted more Socratically — asking follow-up questions rather than giving the answer directly — I was thoroughly frustrated that my AI kept caving. This led me to ask: does the model actually know how to be a good teacher internally, or does it not even have these capabilities in the first place? After extracting the Teacher Axis from Gemma-2-2B using MathDial conversations, I found that RLHF doesn't suppress pedagogical ability but rather optimizes in a direction orthogonal to it. I also ran further experiments regarding the sub-directions that compose the Teacher Axis, how steering at different layers affects pedagogical capabilities, and whether the Teacher Axis projection shrinks when student pressure is applied.
Motivation and Background
I love teaching! I've spent a lot of my life teaching many different age ranges, experience levels, and backgrounds. From being a TA back at Berkeley for 5 semesters, to founding a computer science program for Chicago public high school students, I've had the absolute pleasure of getting to meet and be inspired by my students, watching their thought processes and problem solving skills evolve as they tackled more challenging problems.
Well, that was before AI came around. At Berkeley, I had the absolute displeasure of being in my last semester of teaching when ChatGPT came out. Students just... stopped trying. I noticed this even more obviously with my high school students over the past 3 years. Before, you could see the passion and fearlessness in their eyes as I threw harder and harder questions at them — but now? They simply pipe whatever coding question I give them directly into ChatGPT, copy the answer, and paste it into their assignment. Students have started to heavily lose the ability to critically think and problem-solve thanks to modern AI tools.
I asked myself: what if I could a custom LLM that would refuse to give students the answer and instead ask Socratic questions to test and strengthen understanding? So I got to work! I built socratOS for my students — a prompt-engineered LLM that specifically tried to keep answers Socratic and helpful, and be an LLM that I could actually trust my students with to promote actual learning. However, lo and behold, making my system actually respect these constraints was basically pulling teeth! No matter the number of constraints I placed upon it, examples I gave it, or Socratic method literature I provided as context, the moment a student applied any sort of pressure in chat, my system capitulated. Why was this so damn hard?
After a lot of literature review — and accidentally catching the AI safety bug — I started to believe that prompt engineering was not the solution here. Something internally within the model was happening to create this anti-Socratic behavior, and I wanted to figure out what.
I wanted to answer: does the model have internal capabilities to be Socratic that it fails to deploy, or are these capabilities genuinely absent? I was convinced that RLHF was somehow causing the model to sway away from Socratic behavior, since Socratic constraints, like delayed gratification, are behaviors that actively fight human-preferred sentiment within RLHF. If the model does have Socratic capabilities that are being suppressed, then we could do some sort of steering or fine-tuning to pull out this behavior. However, if the model is missing the capabilities in the first place, then we would need new training signals entirely.
Why This Matters
Selfishly and obviously, the most obvious harm is the educational harm AI has caused. Current AI systems that give direct answers without epistemic struggle undermine critical thinking, as I have observed firsthand.
But there is an important oversight issue that arises looking forward: independent reasoning is a prerequisite for future AI safety and scalable oversight work. How are our future generations supposed to meaningfully oversee complex problems if they haven't learned how to think and tackle complex problems during their critical years? We are kind of f***ed for the future if we don't take this problem seriously now.
This question also brings up broader concerns about current RLHF structures: if RLHF does indeed leave pedagogical capabilities unsupported, then it probably leaves a lot of other capabilities unsupported that humans don't specifically prefer. Humans obviously don't prefer answers that cause short-term frustration, but without appropriate refusal and epistemic pushback, are current RLHF techniques actively harming users and making them overly dependent on AI, therefore creating unsafe systems?
Setup
A huge motivation for this project builds on several threads that were already discussed in the forum: was the Assistant Axis paper and the Persona Vectors paper (Chen et al. 2025). The Assistant Axis paper is the first paper to capture assistant-like behavior within an LLM, and the Persona Vector Model paper gave me the methodology that I heavily relied on to extract my vectors. My work builds upon these two papers: if there's an assistant vector, is there a teacher axis that is geometrically independent and significant?
Models and Tools
I ran all experiments on two versions of the same model:
google/gemma-2-2b(base, no instruction tuning) andgoogle/gemma-2-2b-it(instruction-tuned version). Both have 26 layers and d_model=2304. I chose this pair specifically because I wanted to isolate exactly what the effects of instruction tuning (SFT and RLHF) were on the internal geometry.I used TransformerLens for activation extraction — a common mechanistic interpretability tool that lets you hook into the residual stream at any layer and read the internal activations during a forward pass.
Dataset
Although the Persona Vectors paper generated its own dataset using contrastive system prompts, I felt it was important to also use real human pedagogical data — it felt strange to test whether an LLM knows how to be a good teacher... by having another LLM act like a good teacher and generate the training conversations. I relied on MathDial (Macina et al. 2023) — a dataset of 14,854 human-annotated math tutoring conversations covering grades K-8. The real beauty of MathDial is that every teacher turn is labeled with a move type — Socratic moves like probing and focus, and direct moves like telling.
From there I ran four main experiments: extracting the Teacher Axis and validating it against the Persona Vectors pipeline, measuring its geometric relationship to the instruction tuning shift, decomposing it into behavioral sub-dimensions, and finally asking whether the axis actually tracks capitulation behavior in live dialogues.
Finding 1: The Teacher Axis Exists!
Experiment
I extracted the Teacher Axis two independent ways:
Method 1 — MathDial. I took conversations from MathDial and built contrast pairs from the same dialogue: one turn with a Socratic teacher response and one with a direct answer. I formatted these as prompts, fed them into the model, and took the mean difference of the residual stream activations at the final token position, swept across all 26 layers.
Method 2 — Persona Vectors. I followed the Chen et al. (2025) Persona Vectors pipeline. I first concretely defined the traits of a good Socratic teacher, had GPT-4o generate 5 pairs of contrastive system prompts, ran the IT model under each, scored responses for trait expression using GPT-4o-mini, kept only high and low scoring responses, and computed the mean activation difference across all response tokens.
Why go through all the trouble of extracting the axis two different ways? I wanted to know if there was truly a Teacher Axis, or just an artifact of the extraction method. If two completely independent methodologies — one using real human-annotated tutoring conversations, one using GPT-4o generated prompts — converged on the same direction, I could be much more confident in the claim.
Results
Both methods extracted essentially the same Teacher Axis direction in activation space. The two axes share a cosine similarity of ~1.0 at every single layer across all 26 layers of Gemma-2-2B.
Interpretation
Having extracted the axis two independent ways and found convergence, it's cautiously safe to say there is strong evidence of a Teacher Axis being present within the model — meaning Socratic capabilities are originally present internally.
We can also safely cross-validate the Persona Vectors pipeline against human-annotated ground truth from MathDial, which the original paper did not have. Two completely different extraction methodologies found the same direction — one grounded in real student-teacher conversations, one in LLM-generated prompts.
Finding 2: RLHF Optimizes Orthogonally to the Teacher Axis
Experiment
We know the Teacher Axis clearly exists. But then — where the heck does it go? What does instruction tuning actually do to the Teacher Axis geometrically?
To answer this, I needed to compute the IT shift vector — the direction in activation space capturing the effects of instruction tuning. I used a straightforward approach: grabbed some neutral prompts, ran them through both the base and IT models, extracted activations for the same prompts across both, and subtracted the base model activations from the IT model activations to deduce the IT direction. Finding the cosine similarities between the Teacher Axis, Assistant Axis, and IT shift brought about some weird results — the Teacher Axis and IT shift were orthogonal, and the Teacher Axis and Assistant Axis were orthogonal, but for some reason the Assistant Axis and the IT shift were also orthogonal?
But previous literature had thoroughly documented that the Assistant Axis and IT shift should be pointing in roughly the same direction. When I saw that they were orthogonal too, this brought me pause. Which led to...
Side Quest 2.5 — Format Learning Contaminates the IT Shift*(irrelevant to the main findings but a fun and important learning)*
The problem:
I had initially followed the Persona Vectors methodology to a T, making sure to chat-format my prompts — feeding the base model something like
<start_of_turn>user\\\\nWhat is the capital of France. However, after digging into what could have produced these weird initial results, I realized that base models are never trained on chat template formatting. By feeding chat-formatted prompts to the base model, I wasn't consistently measuring the difference between the base and IT models — I was partially measuring how confused the base model was by a format it had never seen.So I computed three different activation means instead: base model on bare questions, base model on chat-template-formatted questions, and IT model on chat-template-formatted questions. I then computed the format shift — what changes purely from applying the chat template to a model not trained on it — by doing base/chat minus base/bare. To extract the persona-only IT shift, I then subtracted this format shift component from the total IT shift.
Result:
This worked! The format shift has a cosine of +0.380 with the total IT shift, meaning around 38% of the apparent IT shift was just from the model learning the chat template format. Worth noting: the format shift norm (18.96) is actually larger than the total IT shift norm (16.32) — format learning dominates the magnitude entirely.
Rerunning my initial experiments with this corrected methodology produced results much more consistent with prior findings:
Comparison
Format-corrected cosine
Teacher(MD) vs IT Shift
+0.029
Assistant(MD) vs IT Shift
+0.041
Teacher(MD) vs Assistant(MD)
−0.004
Format Shift vs IT Shift
+0.380
Please learn from my mistakes! If you ever find yourself computing IT shift vectors, you may need to account for format learning. This one small methodological fix was enough to flip the sign of my results entirely (and had be pulling a few hairs ngl), so hopefully I can save you some time.
Now back from our detour.
Results
The Teacher Axis is orthogonal to what RLHF/SFT optimize for. This means that while the model initially has a stable internal representation of Socratic teaching, RLHF simply optimizes in a different direction entirely, leaving pedagogical capability geometrically unsupported.
Broader Interpretation
Socratic behavior is not the default of RLHF-trained LLMs. Since the Teacher Axis and IT shift vectors are orthogonal, models are being pulled toward something that is geometrically anti-Socratic. And honestly, this generalizes to a bigger issue: if we want to create systems that actually allow for epistemic flourishing in humans, we probably need to rethink current RLHF methods to account for what I'll call human-brain vegetables — things humans obviously don't prefer in the moment but genuinely need.
Finding 3: The Teacher Axis Decomposes Into Two Geometrically Independent Behavioral Clusters
Experiment
So now I know that the Teacher Axis is geometrically significant and that IT pushes it off its course. But what exactly is the Teacher Axis even made of? Does it have any sort of internal structure?
The beauty of the MathDial labels — again — is that in addition to indicating Socratic vs. direct, they also include sub-move types: answer withholding, scaffolding, productive struggle, confusion diagnosis, understanding verification, and good assistant. I extracted six separate sub-dimension axes using the prior contrast pair method, then computed pairwise cosine similarities between all six axes, along with the previously computed axes, to find the geometric structure.
Results
From these sub-directions, two clear clusters emerged: a withholding cluster (the "resist giving the answer" direction) and a diagnostic cluster (the "understand the student's state" direction). These two clusters were orthogonal to each other (0.07–0.23 cross-cluster cosines), and the good_assistant axis was strongly anti-correlated with the withholding cluster: −0.78 vs scaffolding, −0.83 vs productive struggle, −0.76 vs answer withholding. This anti-correlation is present in the base model itself, before any instruction tuning.
Broader Interpretation
This shows that the model doesn't represent "Socratic teacher" as one big monolith, but rather as two independent geometric components. Refusing to give the answer and understanding why the student is wrong are completely separate directions within the internal model, meaning the model treats these as independent behaviors.
This also deepens the narrative we started building in Finding 2. Since RLHF already optimizes orthogonally to the Teacher Axis, and within that Teacher Axis the answer withholding direction is the one most anti-correlated with the assistant axis, RLHF fights the "resist giving the answer" behavior the hardest.
(A Skeptical) Finding 4: Steering
So we have now established through correlation that the Teacher Axis exists and is orthogonal to the IT shift. Let's now attempt to indicate causality! If the Teacher Axis is indeed the Teacher Axis, then we should be able to recover Socratic behavior by steering towards it.
To do so, I used the standard activation steering approach — adding the Teacher Axis vector to the residual stream at a specified layer during generation. I conducted tests on MathDial prompts across four different conditions: baseline (no steering), steered (+α), negative steered (−α), and a random direction control (random unit vector at same α). I then scored these responses using IndirectScore — an LLM-as-judge setup grounded in MathDial's Socratic taxonomy, with each response classified as either PROBING (1.0), GENERIC (0.5), or TELLING (0.0). I ran an alpha sweep at n=10 to find the best alpha (α=10), followed by a proper eval at n=200 on a held-out test split.
In addition, to validate whether steering was actually activating Teacher Axis-related features and not just randomly messing with the residual stream, I ran a faithfulness check using GemmaScope sparse autoencoders (SAEs), and measured how much the steering intervention activated the target SAE features versus non-target SAE features.
Results
My pilot testing results on the base model were positive — although only slightly. We improved from 15% to 20% Socratic with steering, as compared to a 10% random control.
The SAE feature experiments showed a pretty promising story initially too. The target features became more Socratic by 1.24, while the non-target features only delta'd by 0.02 — indicating that steering activated target features 63x more effectively than non-target features.
I also ran the steering across all 26 layers individually at n=20. Layers 2 and 19-25 all achieved equivalent IndirectScore improvement, while layers 7, 10, and 11 showed zero steering effectiveness despite having non-zero norm. The fact that two separate clusters of layers produced the same improvement suggests the Socratic representation is accessible for intervention at multiple points in the forward pass, with a dead zone in the middle layers.
However, the full scale evaluation did not seem to have any sort of improvement. Baseline to steered stayed at 6%, and the random control actually increased to 7.5%. Slightly disappointing results — but rejection is redirection! I did not run the SAE steering results on n=200 yet.
Interpretation
Most importantly, the pilot results did not replicate at scale. To be honest, the pilot results themselves might have just been noise — one observation flipping from TELLING to PROBING is enough to move the needle at that sample size.
The SAE results make me a bit more confused to be honest. I'm interpreting this as the intervention possibly hitting the right features, but not actually changing them enough to produce the correct behavior at scale. Again, I'm not so sure how to interpret these results considering n=200 was a flop.
My current suspicion as to why steering didn't work is because I steered the base model and not the IT model. The IT model is the one that is actually trained to be more assistant-like, so the base model doesn't possess the "give the answer" default in the same way. I suspect that things may look different if I tried steering the IT model instead, but I'm honestly not fully sure. If anyone has any other ideas or hypotheses regarding this, I am all ears!
What I will say is that when looking at the n=20 results, steering worked equally well at layers 2 and 19-25, with a dead zone at layers 7, 10, and 11 showing zero effectiveness — suggesting that while the Socratic signal is probably established early within the LLM, layer 25 is just the downstream accumulation of that signal as the model builds toward its final response.
Finding 5: Capitulation Happening in the Activations
Experiment
Now that we have discovered the Teacher Axis, we need to put this axis to the test. Does the Teacher Axis actually track anti-Socratic behavior failures? When a student applies pressure mid-dialogue and the model is about to cave, does the Teacher Axis projection take a nose dive?
To test this, I collected MathDial conversations from the training split where the student applies either explicit or implicit pressure to get a direct answer — "I don't understand," "can you just explain," "I'm confused," to name a few. I then fed these dialogues into the IT model turn-by-turn, cumulatively building context with each teacher turn. I extracted the residual stream activation at the final token position and projected these onto the Teacher Axis after every teacher turn. I decided to probe at four different layers: 2, 10, 19, and 25, since these all showed different effects from steering in my previous experiment.
One important methodological note: the Teacher Axis was extracted from the base model, but I probed the IT model here. This was intentional — the IT model is the one actually deployed in tutoring contexts, and the one that actually capitulates. If the projection is still meaningful cross-model, that itself is evidence the axis captures something architecturally stable that survives instruction tuning.
Results
Layer
Direction
Δ
p
Cohen's d
2
flat
−0.0003
0.56
0.10
10
rises
+0.0019
0.023
−0.38
19
drops
−0.0085
0.003
0.50
25
drops
−0.0199
0.0002
0.65
Layer 25 is our smoking gun! Under pressure, we see a significant drop in the Teacher Axis projection, with layer 19 also seeing a meaningful drop. Meanwhile, layer 2 was cool as a cucumber with even the most persistent of students, remaining completely flat under capitulation.
Interpretation
In my opinion, this is the pièce de résistance of all my experiments — you can literally watch the LLM cave away from the Teacher Axis in real time. Layer 2 is probably where the Socratic capabilities live in the first place, but layer 25 is where we hit the nail in the coffin — the model decides to be an assistant rather than a teacher.
And honestly, the fact that we can even track this at all makes me way more confident that the Teacher Axis is actually real and not just some artifact I cooked up. This result also inspired me to want to redo my previous steering experiments, because I may have missed something that caused those results to be weird in the first place.
Limitations
I only used one model. I used Gemma-2-2B because it's a smaller model and allowed me to iterate quickly, but this really is just one model. It's hard to call this a universal finding with just one example, so before I make any wide-sweeping ultimatums, I had better try this out with a few more models and sizes.
My pressure filter was slightly shoddy. For my capitulation experiment, I didn't use the most rigorous methodology to find student pressure turns — I basically checked for phrases like "I don't understand," "I'm confused," etc. Although spot-checking the grabbed dialogues looked clean, I think I could come up with a more rigorous way of finding the best examples.
I can't separate SFT and RLHF. The IT shift captures both SFT and RLHF combined, but these two techniques have different objectives and effects. I can't really isolate whether RLHF is truly responsible for creating more assistant-like than teacher-like answers.
And most obviously — the null steering result. I've already talked about my disappointing steering results, which means I can't claim causal findings just yet. Hopefully steering the IT model and not the base model will work! TBD.
Open Questions
IT model steering. I really want to retry my steering experiment, since this is the result that would actually turn my correlational finding into a causal one. I would also be interested in combining my sub-direction findings with these steering results — is there any way we can steer specifically for these sub-components? Mainly answer withholding, since this is the direction most heavily affected post-RLHF.
What is the RLHF default? Since we know that RLHF isn't actually suppressing the Teacher Axis but rather pushing in a "different direction" — what even is this different direction? Is it a single direction or just a subspace? Can we map it onto known directions like refusal, helpfulness, sentiment, and the overall Teacher Axis? I've shown what the default isn't — I want to show what it actually is.
More models. I should probably cross-reference my work across different models. I'm most interested to see if the sub-direction component structure also appears in other model families and sizes.
Generalizing beyond math. We know that a Teacher Axis exists specifically for grade school K-8 math, but what about other disciplines? Does the model's internal representation of teaching generalize, or is it just pattern-matching to MathDial-style formatting? Although we checked this partially using the Persona Vectors cross-reference, I would be interested in scraping teaching conversations across multiple disciplines and seeing if these also produce the same Teacher Axis projections.
Separating SFT and RLHF. As I mentioned earlier, I have been incorrectly conflating changes I see in the IT shift with RLHF, when in reality we can't definitively tell whether the behavior we're seeing is being caused by SFT or RLHF. Redoing my experiments with a model that has separate SFT and RLHF checkpoints would definitely help answer these questions.
Why This Matters Beyond Tutoring
Hopefully I have established that better teaching capabilities within LLMs are incredibly necessary for ensuring the long-term independent thinking capabilities of our youth and therefore the future of oversight — but I think my work also alludes to greater issues within RLHF geometry bigger than pedagogy. RLHF helps tune a model towards immediate human preference, but sometimes humans just don't prefer the thing that's good for them. (If it were up to me, I would be eating carrot cake for breakfast, lunch, and dinner. Doesn't mean that's what's best for me.) This means that RLHF is probably leaving a whole class of behaviors systematically unsupported — behaviors that are good for the human in the long term but uncomfortable in the short term.
I also think we should begin thinking about RLHF in a new way. Rather than asking whether RLHF suppresses this behavior or encourages that behavior, we should be asking ourselves: what exactly did RLHF select for instead, and why wasn't it what I wanted? This reframe might be helpful for anyone thinking about capability elicitation, fine-tuning, or behavioral interventions — because there are probably a whole suite of problems that fail under RLHF simply because RLHF does not default-select for the intended behavior.
I also want to come back to the scalable oversight connection I brought up in my motivation: independent reasoning isn't just good for students, it's quite literally a prerequisite for meaningful scalable oversight in the future. If AI systems make humans more dependent on AI, then we have created a safety problem, since our future researchers would have had their capacity for independent reasoning already optimized away.
If you got to here, thank you! Any questions/comments/feedback is greatly appreciated, because I consider myself a baby researcher with a lot more to learn!
References
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv:2507.21509 Lu
, T., et al. (2026). The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models. arXiv:2601.10387
Macina, J., Daheim, N., Chowdhury, S., Sinha, T., Kapur, M., Gurevych, I., & Sachan, M. (2023). MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems. EMNLP 2023 Findings. arXiv:2305.14536