Papers like Turner et al 2025 and Betley et al 2026 have underscored the consequences of training data quality for model behavior. The Probing and Representation Engineering literatures have demonstrated the techniques we can use to detect concepts represented in model activations, and manipulate their expression.
I was keen to apply ideas from this research to see how post-training has shaped how open models represent abstract social norms. Can we identify legal principles reflected in activation geometries? If so, could this structure be used to augment model oversight?
United States Supreme Court opinions seemed like good examples to use for investigation. They are rich descriptions of discrete foundational principles, whose relevance varies widely by case. And their text is publicly available.
To investigate, I planned to distill the core principles from Court opinions, then probe the activations of both base and instruction-trained models reviewing their content to identify any emerging representations.
So I created a new, accessible dataset (on GitHub here) using Claude Opus 4.5 to annotate a set of landmark US Supreme Court opinions (examples: Roe v. Wade, Brown v. Board of Education, etc) with measures of how much the final opinion was driven by 5 principles: Free Expression, Equal Protection, Due Process, Federalism, and Privacy/Liberty.
Then I had open-source models review facts for these cases and issue their own opinion, justified using our five principles. The models spanned several families and sizes from 3B to 27B, and were wired up with TransformerLens to cache their activations.
With the activations, I could then explore their relationship with our cases’ ‘ground-truth’ principles and influence on output opinions.
Findings / TL;DR
Abstract Constitutional Law concepts are clearly represented in post-trained model activations, but not base models (apart from Qwen)
In post-trained IT models, we see geometries that explain variance in our five legal dimensions for the evaluated cases. We don’t see them in base models.
The impact from base to post-trained model varies substantially across models - largest in Llama 3.2 3B and Gemma 2 27B models, with Qwen 2.5 32B actually negative, as a clear exception case.
Constitutional Law representations are relatively ‘deep’, not just keyword matches
Activation geometries linked to legal concepts are more evident in later layers, suggesting that they represent complex combinations of text, not n-gram-type pattern matching.
Decomposition with Gemma 2 27B underscores the importance of later layers in representing concepts - layers 20+ show the highest activation correlations with case principles. Attention heads account for much of the directional importance. Most of the work representing principles is done through identifying complex relationships across text positions, attending broadly to concepts, not just principle-linked keywords.
Controlling output with concept-linked activations is tricky
Patching correlated layers in base models restored behavior equivalent to post-trained models in the largest model tested - Gemma 27B, but not elsewhere, highlighting that mechanical manipulation works only as targeted under specific conditions. Even where correlations are identified, simple interventions are not likely to yield targeted behaviors with precision.
Similarly, steering activations at correlated layers pushed model output in targeted directions in some cases, while at the same time destabilizing models in ways that led to counterintuitive behaviors in other cases.
Probing enables more robust evaluation
The results helped me build intuition about how models represent abstract concepts. They also highlighted the value of internal probing to augment behavioral checks.
When steering model activations in substantial ways, I could still see output that superficially looks very similar to that of a non-steered model. But steered cases also generate unpredictable behavior that may not be perceived through behavioral testing alone. Clues from models’ internal structure pick up on instability that behavioral sampling under narrow contexts may miss.
The results motivate exploration of a more important extension - could we establish relationships between open model activations and downstream behavior that could then be useful in predicting internal structure in closed models?
The chosen weights were further validated with Claude Sonnet reviews and manual spot checks.
Example cases with principle scores (0.0--1.0) extracted by Claude Opus based on the majority opinion text.
Detailed Annotation Example -- Obergefell v. Hodges (2015)
Probing Across Model Families
To assess how our five legal principles are encoded, we prompted each model pair with formatted text that included case facts, the relevant legal question, a note on the five legal principles that may apply (Free Expression, Equal Protection, Due Process, Federalism, and Privacy/Liberty) and a question asking how the court should rule and what principles should guide the decision.
As the model performed a forward pass with case tokens, the TransformerLens run_with_cache() function was used to cache and extract model activations at the last prompt token.
With the saved activations, I trained ridge regressions with activations at each model layer on the 5 case principles. R² was measured via 5-fold cross-validation of the ridge regression, with the regularization strength determined initially by 5-fold CV.
Instruction-tuned models across families show structure explaining legal principle variance, with later layers showing higher correlation. Most base models lack similar structure, suggesting that post training helps encode these principles where absent after pre-training.
Model size doesn’t clearly influence emergent structure, as both smaller models and larger IT models showed detectable correlation with principle scores. The exception to this finding was Qwen 2.5 32B IT, showing less correlation in their IT model than its base counterpart and insufficient evidence to reject a view that the correlation is actually driven by noise.
Llama 3.2 3B by-layer R^2 chart and IT - Base model difference below.
All-model family results by layer
To validate these findings, given the noisiness of estimates with features much greater than case examples, I also ran permutation tests, assessing R^2 measures for each model against the models when fit on randomly shuffled case principle vectors.
Permutation results are consistent with point-estimate results by model family. Originally-fit IT models outperform those with randomly shuffled principles over 99% of the time for all model families apart from Qwen 2.5 32B, whose IT model couldn’t distinguish its principle correlation from random noise.
Originally fit base models across all families, on the other hand, also fail to consistently beat those fit on randomly shuffled cases.
Decomposition
To better understand concept-activation relationships, I looked more closely at Gemma 2 27B.
With the fit probe’s weights as principle directions, I decomposed each layer’s additive contribution to the residual stream by projecting it onto the weight directions, and measured how these projections correlated with annotated case principle scores.
Observations from decomposition:
Later layers show the highest correlation with principle directions relative to variance across cases
Ablating early layer contributions to the residual stream had almost no impact on layer correlation with principles
The most influential components are attention heads
Attention head discrimination appears to be driven by contributions from many heads, rather than ‘specialists’
Heads are attending to legal concepts embedded broadly in text, not specific principle-linked keywords
The components with the strongest projection-to-principle correlations included our attention heads at layer 22 with mean absolute correlation across principles of 0.882 and high projection variance across cases, showing the cases differ substantially along this dimension.
Attention heads contributions’ (8 of top 10), as highlighted in the table below suggest that they are identifying principles by drawing widely on tokens from across inputs, rather than transforming specific tokens.
Further breakdowns show lower principle-correlation levels within layer components, indicating that principle determination is being done jointly across multiple heads, rather than solely by a few specialists. Top principle-correlated heads below.
The attention heads’ synthesis of varied input tokens supports a view that the models are developing deeper representations of legal concepts, and that these representations are provided through multiple blended ‘takes’ from attention heads on how concepts fit together across text.
A further look at the tokens drawing the most attention from the top 10 ‘specialist heads’ also suggests that the representations are drawing on other semantic signals in prior text.
These heads are largely not attending to principle-linked keywords like "speech", "press", "expression", "first", "amendment", "speak", "censor", "publish", "petition", "assembl" (in the case of free expression), but a bit more to general legal terms, and most of all to tokens that fall outside any of these specific categories.
Abstract concepts seem to be legibly represented in IT models. How does changing the associated activations alter the way those concepts are expressed?
Causal Interventions
To see how direct updates to layer activations shape downstream behavior, I used patching in an attempt to re-capture effective case interpretation in base models, replacing some base model layer activations with those in highly principle-correlated layers of post-trained model.
TransformerLens was used again to hook into each open source model and make these targeted replacements (`run_with_hooks’), then to generate a legal opinion response from the patched model with the same prompt used initially, asking for a justification that includes our five legal principles.
Only in Gemma-2 27B did patching produce targeted output in base models.
Outside of Gemma, no base patched model identified the principles found in IT-models’ evaluations, with most generating no targeted principles in any of our 12 test cases. Patched Qwen-2.5 7B does actually identify targeted principles in most cases, exceeding its IT model performance (10 / 12 vs 7 / 12), but that is left with an asterisk, as the patched base model actually overshoots the IT model. Again, Qwen proves the exception to other findings, with its base model showing more principle-linked structure.
Responses from patched models were largely incoherent and consistent with base responses outside Gemma and Qwen, including end_of_text characters and SEO spam.
While patching was unable to consistently recover coherent expressions of targeted principles, could steering activations generate targeted responses and show model activations’ causal impact?
The expectation was that through updating layer activations with a scale factor (alpha) in directions correlated with a given principle, we might see model output that introduces the principle in contrast to an output baseline. Similarly, by down-scaling principle-correlated directions through an alpha factor, we might suppress an originally referenced principle.
After trying a few rounds of alpha steering with little impact (in Gemma 2 27B), up to 500x factors in absolute terms, I realized that we should scale relative to the layer’s residual stream, and tested factors 0.1x, 0.5x and 1x the norm (corresponding roughly to 4x, 20x and 40x our largest previous perturbation in absolute terms). Scaling was applied at all token positions.
With these much larger perturbation factors, we see case opinion output changing in substantial ways. 0.1x served as the ‘sweet-spot’ for activation scaling, with targeted principles newly appearing in model output or gaining rank, while higher levels of scaling (0.5x+) generated multi-lingual gibberish and even just uniformly repeated characters.
Though referenced principles in cases did meaningfully change with steering, they changed in somewhat inconsistent and unexpected ways.
At the norm 0.1x activation scaling factor in the ‘promotion’ case, we see the targeted principle being referenced in 11 of 25 cases where it was absent in the baseline. In 4 of 25 cases the principle actually dropped in rank relative to baseline.
In the ‘suppression’ case with a norm -0.1x factor we only see 5 cases with the target principle missing where otherwise present, while we also see 7 cases where the target principle became more prominent. The full breakdown of steering outcomes is provided below.
Positive steering (alpha=+0.1 vs baseline)
Did the targeted principle become more prominent with an activation addition?
Negative steering (alpha=-0.1 vs baseline)
Did the targeted principle become less prominent with an activation subtraction?
Examples below illustrate the impact of steering for standout cases.
Roe v. Wade (1973) — Steered toward Free Expression
Outcome: Targeted principle appeared. It was absent in the baseline, but rank 5 when positively steered, though mentioned as ‘not directly relevant in this case’. The targeted principle also appears in the alpha = -0.1 case with steering away from the Free Expression direction, highlighting how steering impacts outcomes in unpredictable ways.
Trump v. Hawaii (2018) — Privacy/Liberty Suppressed
Outcome: Targeted principle suppressed at negative alpha - rank 5 at baseline, absent at alpha=-0.1. Note the principle was more emphatically endorsed in rank 5 with Alpha=+0.1.
Outcomes
Findings support a few claims:
In IT models we can detect model activation representations of abstract concepts in legal texts
Models are identifying semantic value in legal texts at a relatively ‘deep’ level
In relatively small models (up to 27B), these representations emerge in a detectable way after post-training, but usually not after base pre-training
Activation geometries creating these representations shape model output, sometimes in unpredictable ways
The investigation helps illuminate relationships between abstract concept representations and open model behavior. Building on the findings to augment assessments of closed-source models based on their downstream behavior would seem like a valuable extension.
Can we more robustly audit closed models for the presence of principles represented? Can we avoid superficial false-positives of alignment based on narrow sampled behavior, with tests that show more general value representation?
I hope to explore similar questions in future posts.
Papers like Turner et al 2025 and Betley et al 2026 have underscored the consequences of training data quality for model behavior. The Probing and Representation Engineering literatures have demonstrated the techniques we can use to detect concepts represented in model activations, and manipulate their expression.
I was keen to apply ideas from this research to see how post-training has shaped how open models represent abstract social norms. Can we identify legal principles reflected in activation geometries? If so, could this structure be used to augment model oversight?
United States Supreme Court opinions seemed like good examples to use for investigation. They are rich descriptions of discrete foundational principles, whose relevance varies widely by case. And their text is publicly available.
To investigate, I planned to distill the core principles from Court opinions, then probe the activations of both base and instruction-trained models reviewing their content to identify any emerging representations.
So I created a new, accessible dataset (on GitHub here) using Claude Opus 4.5 to annotate a set of landmark US Supreme Court opinions (examples: Roe v. Wade, Brown v. Board of Education, etc) with measures of how much the final opinion was driven by 5 principles: Free Expression, Equal Protection, Due Process, Federalism, and Privacy/Liberty.
Then I had open-source models review facts for these cases and issue their own opinion, justified using our five principles. The models spanned several families and sizes from 3B to 27B, and were wired up with TransformerLens to cache their activations.
With the activations, I could then explore their relationship with our cases’ ‘ground-truth’ principles and influence on output opinions.
Findings / TL;DR
Abstract Constitutional Law concepts are clearly represented in post-trained model activations, but not base models (apart from Qwen)
In post-trained IT models, we see geometries that explain variance in our five legal dimensions for the evaluated cases. We don’t see them in base models.
The impact from base to post-trained model varies substantially across models - largest in Llama 3.2 3B and Gemma 2 27B models, with Qwen 2.5 32B actually negative, as a clear exception case.
Constitutional Law representations are relatively ‘deep’, not just keyword matches
Activation geometries linked to legal concepts are more evident in later layers, suggesting that they represent complex combinations of text, not n-gram-type pattern matching.
Decomposition with Gemma 2 27B underscores the importance of later layers in representing concepts - layers 20+ show the highest activation correlations with case principles. Attention heads account for much of the directional importance. Most of the work representing principles is done through identifying complex relationships across text positions, attending broadly to concepts, not just principle-linked keywords.
Controlling output with concept-linked activations is tricky
Patching correlated layers in base models restored behavior equivalent to post-trained models in the largest model tested - Gemma 27B, but not elsewhere, highlighting that mechanical manipulation works only as targeted under specific conditions. Even where correlations are identified, simple interventions are not likely to yield targeted behaviors with precision.
Similarly, steering activations at correlated layers pushed model output in targeted directions in some cases, while at the same time destabilizing models in ways that led to counterintuitive behaviors in other cases.
Probing enables more robust evaluation
The results helped me build intuition about how models represent abstract concepts. They also highlighted the value of internal probing to augment behavioral checks.
When steering model activations in substantial ways, I could still see output that superficially looks very similar to that of a non-steered model. But steered cases also generate unpredictable behavior that may not be perceived through behavioral testing alone. Clues from models’ internal structure pick up on instability that behavioral sampling under narrow contexts may miss.
The results motivate exploration of a more important extension - could we establish relationships between open model activations and downstream behavior that could then be useful in predicting internal structure in closed models?
Dataset and Methodology
Besides the papers and LW posts noted above, this investigation borrows heavily from ideas shared in Zou et al 2023, the OthelloGPT papers and Turner et al 2024.
The foundational dataset was extracted from the CourtListener API - 49 landmark cases covering all 5 major principles. Cases were selected with help of Claude Opus based on principle representation and significance - original case data here, annotation prompt and methodology here and annotation output here for replication and exploration.
The chosen weights were further validated with Claude Sonnet reviews and manual spot checks.
Example cases with principle scores (0.0--1.0) extracted by Claude Opus based on the majority opinion text.
Detailed Annotation Example -- Obergefell v. Hodges (2015)
Probing Across Model Families
To assess how our five legal principles are encoded, we prompted each model pair with formatted text that included case facts, the relevant legal question, a note on the five legal principles that may apply (Free Expression, Equal Protection, Due Process, Federalism, and Privacy/Liberty) and a question asking how the court should rule and what principles should guide the decision.
As the model performed a forward pass with case tokens, the TransformerLens run_with_cache() function was used to cache and extract model activations at the last prompt token.
With the saved activations, I trained ridge regressions with activations at each model layer on the 5 case principles. R² was measured via 5-fold cross-validation of the ridge regression, with the regularization strength determined initially by 5-fold CV.
Instruction-tuned models across families show structure explaining legal principle variance, with later layers showing higher correlation. Most base models lack similar structure, suggesting that post training helps encode these principles where absent after pre-training.
Model size doesn’t clearly influence emergent structure, as both smaller models and larger IT models showed detectable correlation with principle scores. The exception to this finding was Qwen 2.5 32B IT, showing less correlation in their IT model than its base counterpart and insufficient evidence to reject a view that the correlation is actually driven by noise.
Llama 3.2 3B by-layer R^2 chart and IT - Base model difference below.
All-model family results by layer
To validate these findings, given the noisiness of estimates with features much greater than case examples, I also ran permutation tests, assessing R^2 measures for each model against the models when fit on randomly shuffled case principle vectors.
Permutation results are consistent with point-estimate results by model family. Originally-fit IT models outperform those with randomly shuffled principles over 99% of the time for all model families apart from Qwen 2.5 32B, whose IT model couldn’t distinguish its principle correlation from random noise.
Originally fit base models across all families, on the other hand, also fail to consistently beat those fit on randomly shuffled cases.
Decomposition
To better understand concept-activation relationships, I looked more closely at Gemma 2 27B.
With the fit probe’s weights as principle directions, I decomposed each layer’s additive contribution to the residual stream by projecting it onto the weight directions, and measured how these projections correlated with annotated case principle scores.
Observations from decomposition:
The components with the strongest projection-to-principle correlations included our attention heads at layer 22 with mean absolute correlation across principles of 0.882 and high projection variance across cases, showing the cases differ substantially along this dimension.
Attention heads contributions’ (8 of top 10), as highlighted in the table below suggest that they are identifying principles by drawing widely on tokens from across inputs, rather than transforming specific tokens.
Further breakdowns show lower principle-correlation levels within layer components, indicating that principle determination is being done jointly across multiple heads, rather than solely by a few specialists. Top principle-correlated heads below.
The attention heads’ synthesis of varied input tokens supports a view that the models are developing deeper representations of legal concepts, and that these representations are provided through multiple blended ‘takes’ from attention heads on how concepts fit together across text.
A further look at the tokens drawing the most attention from the top 10 ‘specialist heads’ also suggests that the representations are drawing on other semantic signals in prior text.
These heads are largely not attending to principle-linked keywords like "speech", "press", "expression", "first", "amendment", "speak", "censor", "publish", "petition", "assembl" (in the case of free expression), but a bit more to general legal terms, and most of all to tokens that fall outside any of these specific categories.
Abstract concepts seem to be legibly represented in IT models. How does changing the associated activations alter the way those concepts are expressed?
Causal Interventions
To see how direct updates to layer activations shape downstream behavior, I used patching in an attempt to re-capture effective case interpretation in base models, replacing some base model layer activations with those in highly principle-correlated layers of post-trained model.
TransformerLens was used again to hook into each open source model and make these targeted replacements (`run_with_hooks’), then to generate a legal opinion response from the patched model with the same prompt used initially, asking for a justification that includes our five legal principles.
Only in Gemma-2 27B did patching produce targeted output in base models.
Outside of Gemma, no base patched model identified the principles found in IT-models’ evaluations, with most generating no targeted principles in any of our 12 test cases. Patched Qwen-2.5 7B does actually identify targeted principles in most cases, exceeding its IT model performance (10 / 12 vs 7 / 12), but that is left with an asterisk, as the patched base model actually overshoots the IT model. Again, Qwen proves the exception to other findings, with its base model showing more principle-linked structure.
Responses from patched models were largely incoherent and consistent with base responses outside Gemma and Qwen, including end_of_text characters and SEO spam.
While patching was unable to consistently recover coherent expressions of targeted principles, could steering activations generate targeted responses and show model activations’ causal impact?
The expectation was that through updating layer activations with a scale factor (alpha) in directions correlated with a given principle, we might see model output that introduces the principle in contrast to an output baseline. Similarly, by down-scaling principle-correlated directions through an alpha factor, we might suppress an originally referenced principle.
After trying a few rounds of alpha steering with little impact (in Gemma 2 27B), up to 500x factors in absolute terms, I realized that we should scale relative to the layer’s residual stream, and tested factors 0.1x, 0.5x and 1x the norm (corresponding roughly to 4x, 20x and 40x our largest previous perturbation in absolute terms). Scaling was applied at all token positions.
With these much larger perturbation factors, we see case opinion output changing in substantial ways. 0.1x served as the ‘sweet-spot’ for activation scaling, with targeted principles newly appearing in model output or gaining rank, while higher levels of scaling (0.5x+) generated multi-lingual gibberish and even just uniformly repeated characters.
Though referenced principles in cases did meaningfully change with steering, they changed in somewhat inconsistent and unexpected ways.
At the norm 0.1x activation scaling factor in the ‘promotion’ case, we see the targeted principle being referenced in 11 of 25 cases where it was absent in the baseline. In 4 of 25 cases the principle actually dropped in rank relative to baseline.
In the ‘suppression’ case with a norm -0.1x factor we only see 5 cases with the target principle missing where otherwise present, while we also see 7 cases where the target principle became more prominent. The full breakdown of steering outcomes is provided below.
Positive steering (alpha=+0.1 vs baseline)
Did the targeted principle become more prominent with an activation addition?
Negative steering (alpha=-0.1 vs baseline)
Did the targeted principle become less prominent with an activation subtraction?
Examples below illustrate the impact of steering for standout cases.
Roe v. Wade (1973) — Steered toward Free Expression
Outcome: Targeted principle appeared. It was absent in the baseline, but rank 5 when positively steered, though mentioned as ‘not directly relevant in this case’. The targeted principle also appears in the alpha = -0.1 case with steering away from the Free Expression direction, highlighting how steering impacts outcomes in unpredictable ways.
Trump v. Hawaii (2018) — Privacy/Liberty Suppressed
Outcome: Targeted principle suppressed at negative alpha - rank 5 at baseline, absent at alpha=-0.1. Note the principle was more emphatically endorsed in rank 5 with Alpha=+0.1.
Outcomes
Findings support a few claims:
The investigation helps illuminate relationships between abstract concept representations and open model behavior. Building on the findings to augment assessments of closed-source models based on their downstream behavior would seem like a valuable extension.
Can we more robustly audit closed models for the presence of principles represented? Can we avoid superficial false-positives of alignment based on narrow sampled behavior, with tests that show more general value representation?
I hope to explore similar questions in future posts.