Research update on preliminary results from a solo research project.
TL;DR: Activation Oracles (AOs) are a promising general-purpose interpretability tool. However, current implementations treat layers independently, limiting their ability to process cross-layer information. In this research update, I provide preliminary evidence that giving AOs access to simultaneous activations across layers improves their generalizability. I find that Multi-Layer Activation Oracles (MLAOs) fine-tuned on activations across three layers outperform their single-layer counterparts by up to 7% on out-of-distribution classification tasks. This advantage increases up to 11% with six processed layers —suggesting that cross-layer information is a valuable resource for improving AO reliability. Code available on GitHub.
Introduction
Interpreting the internal activations of large language models (LLMs) remains a significant challenge. While mechanistic interpretability methods offer insights by decomposing and connecting activations into meaningful units such as circuits, bridging the gap between these individual units and a complete mechanistic picture of model behavior remains an open problem. Recently, promising non-mechanistic approaches such as LatentQA and Activation Oracles (AOs) have emerged. These methods involve training a secondary LLM to interpret the activations of a target model, effectively allowing us to "chat" with the AO about the target model’s internal state.
Current AOs already exceed white-box and black-box baselines on diverse out-of-distribution (OOD) evaluations, suggesting potential as "general-purpose activation explainers." However, deploying AOs as trustworthy safety monitors requires improvements in their reliability. While previous work has focused on scaling training data quantity and diversity to improve performance, I argue that a lack of cross-layer information creates another bottleneck. Representations in LLMs evolve or can be distributed across layers. Therefore, restricting an interpreter to a single activation layer likely discards context required for robust generalization.
To address this, I introduce Multi-Layer Activation Oracles (MLAOs)—an architectural generalization of AOs that simultaneously processes sets of target model activations across layers. I hypothesize this improves performance via two mechanisms:
The oracle attends to the specific layer most relevant for the current query.
The oracle aggregates representations that are spread across layers.
To test their performance, I fine-tuned and evaluated a set of MLAOs based on Qwen3-4B and Qwen3-8B on the suite of classification tasks used by Karvonen et al. (2025). I find that MLAOs consistently improve upon AO baselines on out-of-distribution (OOD) tasks, while showing similar performance on tasks seen during training. Averaged across OOD tasks, six-layer MLAOs based on Qwen-4B reach a classification accuracy of up to 76.4%, which surpasses the respective single-layer variant by over 11%. The performance gain is most pronounced in OOD tasks such as language identification (see Figure 1). Notably, I observe that the OOD performance increases with the number of processed layers and the number of resampled training examples, suggesting a promising path to increase MLAO accuracy further.
Figure 1: Cross-layer context improves generalization. The single-layer Activation Oracle fails to identify the language of the prompt in the given example. The three-layer MLAO correctly identifies it when processing activations from multiple model depths simultaneously.[1]
Methods
Architecture
During training and inference, standard AOs process a sequence of target activations {vli}Ki=1 for a selected set of K token positions, extracted from a single[2] sampled layer l of a frozen target model.
In contrast, I train MLAOs on a set of target activations extracted simultaneously across N layers l∈{L1,…,LN}. To process this multi-layer information, I concatenate the activations along the sequence dimension into a single flattened sequence ~v. This sequence has length T=N×K and is organized block-wise by layer:
Closely following the architecture of standard AOs, I inject this flattened sequence into the oracle model's context. This is achieved by adding the normalized target activations ~vj to the oracle model's own residual stream activations h1j (at layer 1), specifically at the positions of the T designated <ACT> placeholder tokens:
^h1j=h1j+|h1j|⋅~vj|~vj|
where j∈{1,…,T} is the index over the combined sequence of placeholder tokens.
To match the shape of the flattened activation sequence, the oracle prompt contains a total of N×K placeholder tokens, organized structurally by layer depth:
These layer-grouped placeholders are immediately followed by the task-specific natural language query (e.g., "What concept is active here?") to complete the prompt.
Figure 2: MLAO Architecture. Unlike standard AOs which process a single layer, MLAOs extract activations from multiple layers (L1,L2,…,LN) and simultaneously inject them into the Oracle's residual stream at specific placeholder tokens (<ACT>). This allows the model to attend to cross-layer representations dynamically.
Training
I trained MLAOs on datasets ranging from 400k to 1M datapoints with a varying mixture of three datasets (context prediction, classification, system prompt QA), largely mirroring the baseline method from Karvonen et al. (2025).[3] To accommodate multi-layer inputs and ensure better comparability with the baseline I introduce two structural adjustments to the dataset composition:
Simultaneous Extraction: Each training example contains target model activations extracted from all N target layers simultaneously at the K selected token positions. This differs from the baseline, which creates separate training examples for each layer and resamples the token positions for each one.
Repeated Sampling (Ns): For each pair of target prompt and oracle prompt on classification tasks, the positions to extract activations are resampled Ns times. This partially restores the quantity and diversity of seen token positions compared to the AO baseline, where token positions are resampled for each layer.
Trained Model Overview
To study the scaling of performance, I trained seven MLAO variants[4] and compared them against the original single-layer baselines (AO Qwen3-4B and AO Qwen3-8B) released by Karvonen et al. (2025).
I label my variants using the format MLAO Qwen3-XB-XL-XN, where the suffixes indicate the specific configuration:
XB (Base Model): Uses either Qwen3-4B or Qwen3-8B.
XL (Layers Processed): Processes activations from three layers (3L) or six layers (6L) simultaneously.
XN (Resampling Ns): Indicates if token positions on classification data were resampled once (1N), three (3N), or six (6N) times.
All MLAO runs were conducted on a single H100 GPU, with training times ranging from 4 to 18 hours depending on the configuration. To ensure a fair comparison with the baselines, I utilized the same training hyperparameters as the original AO models.
Evaluation
I evaluated the trained MLAOs on the same suite of binary classification tasks from the baseline. This suite includes:
In-Distribution (IID) tasks: The seven tasks used during training (e.g., sentiment analysis, verb tense identification).
Out-of-Distribution (OOD) tasks: Seven novel tasks that the models never saw during training (e.g., language identification, headline topic identification).
As in Karvonen et al. (2025), I selected 750 evaluation points per task (250 examples combined with three sampled oracle prompts), and evaluated only on single-token activations from the end of the target sequence.
Results
MLAOs show improved generalizability on OOD classification tasks
I found that MLAOs consistently improve upon their corresponding AO baselines on OOD tasks, while achieving similar performance on tasks seen during training.
Three Layers: Averaged across all OOD tasks, the three-layer models (MLAO-Qwen3-4B-3L) achieve classification accuracies of 71.6% – 72.2%[5], marking an improvement of 6.6% – 7.2% compared to the single-layer baseline (AO-Qwen-4B).
Six Layers: This advantage increases to 8.7% – 11.4% as the number of processed layers is increased to six.
Model Size: Similar improvements are seen for the 8B base model, where the three-layer MLAO-Qwen3-8B-3L reaches averaged OOD classification accuracies of 75.6% – 77.5%, an improvement of 5.8% – 7.7% over the baseline.
Notably, these advantages remain significant even for the 1N models (Ns=1). Despite being trained on only ~400k datapoints (compared to the standard AO's ~1M), these models still outperform the baseline.[6]
Figure 3: MLAOs improve Out-of-Distribution (OOD) generalization. Aggregated accuracy across seven held-out classification tasks. The six-layer MLAO (yellow) outperforms the single-layer baseline (red) by over 11%. Lighter stack segments indicate additional gains from increased data sampling (Ns=1,3,6).
Figure 4: MLAOs maintain competitive in-distribution performance. Accuracy on tasks seen during training remains stable compared to the baseline, confirming that the added complexity does not degrade capabilities.
Biggest improvements in Language Identification and Singular/Plural Discrimination
Performance improvements are observed to varying extents across all seven OOD tasks. The most significant gains are found in tasks such as singular/plural (“Is this sentence referring to one individual?”) and language identification (“Is this text written in English?”), where the classification accuracy jumps from ~60% (AO-Qwen-4B) to 80–90% (MLAO-Qwen3-4B-6L). Across all tasks seen IID during training, performance remains comparable to the baseline, with slight improvements visible in the Relations and NER tasks.
Figure 5: Task-specific performance (OOD). The generalization improvement is most visible in Language Identification and Singular/Plural discrimination, where MLAOs surpass their baselines by up to ~20%.
Figure 6: Task-specific performance (IID). Performance remains high across all training tasks IID to training regardless of the number of layers processed.
Discussion
The robust performance gains across OOD tasks indicate that providing cross-layer information helps AOs generalize. Remarkably, this holds true even when the MLAO observes significantly fewer training examples (as seen in the 1N runs), suggesting that cross-layer context is a highly efficient learning signal.
While the precise mechanism for this advantage requires further study, the distribution of task-specific improvements offers a potential hint. The largest boosts occur in "low-level" tasks like Language Identification and Singular/Plural detection. These linguistic features are likely computed in specific (potentially early) layers that a standard AO— restricted to looking at the middle of the model—might miss.
Limitations & Future Work
While these preliminary results are promising, several open questions remain regarding the capabilities and constraints of MLAOs.
Mechanism: I did not formally evaluate how the model utilizes the extra information, i.e. whether it attends mostly to a single layer for a given task or aggregates features across all layers. Along the same lines, it would be interesting to test if standard AOs trained specifically on early layers can match MLAO performance on low-level tasks like Language Identification.
Scaling: My evaluation stopped at six layers, which represents only a sixth of the model's total depth. Further study is needed to determine the optimal layer count and whether performance gains continue to scale or eventually plateau.
Input Ordering: Current MLAOs flatten input by layer (processing all token positions for Layer 1 before moving to Layer 2). Future work should explore if scanning through layers first for a given token position affects performance.
Inference Cost: Processing N layers increases the oracle's prefix length by a factor of N, making inference computationally more expensive than single-layer AOs.
Oracle Robustness: I found that both AO and MLAO responses are sensitive to semantically equivalent variations of the oracle prompt. In particular, I found that oracles often struggle to discriminate between the target activations and the text of the oracle prompt itself, sometimes "hallucinating" concepts from injected tokens.[7]
Conclusion
MLAOs represent a natural extension of the activation interpreter paradigm introduced by Pan et al. (2024) and Karvonen et al. (2025). Much like Crosscoders generalize Sparse Autoencoders to capture features distributed across layers, multi-layer processing extends the potential of AOs to utilize the model’s computational depth. Although this adds inference cost and further research on their robustness is required, my findings on improved generalization across classification tasks suggest that MLAOs could be a valuable addition to the interpretability toolbox.
This work relies entirely on the training datasets curated and released by Karvonen et al. In the following, I give an overview of the differences in composition used in my work to study multi-layer contexts.
1. Context Prediction
Total Training Datapoints: 200,000
Task: Predict either the next token (50%) or previous tokens (50%) based on the activation context.
Difference from Baseline:Karvonen et al. sampled 200,000 positions independently for each of the three chosen layers (yielding ~600,000 total training examples). I limit extraction to 200,000 unique positions total. This results in fewer total training steps but richer information per step.
2. Classification
Total Training Datapoints: 126,000 – 756,000 (depending on Ns)
Composition: The total size is derived as: 7 Datasets×6,000 Target Prompts×3 Oracle Prompts×Ns Resampled Positions.
Task: Answer binary yes/no questions based on seven classification datasets.
Difference from Baseline: I introduce a resampling factor Ns∈{1,3,6} which controls how often the activation positions are resampled for each Target - Oracle prompt combination. This yields datasets of 126k, 378k, and 756k examples respectively. Notably, Ns=3 approximately matches the baseline's number of classification examples.
3. System Prompt QA
Total Training Datapoints: 64,000
Task: Answer questions about the system prompt based on activations from the SPQA dataset.
Difference from Baseline: Identical to the dataset used in the baseline.
Appendix B: Overview of Trained Models
I trained seven Qwen3-based MLAO variants to test their performance across model size, number of processed layers, and data volume (resampling activation positions on classification tasks Ns times).
I observe that the performance of all Qwen3-based MLAOs and AOs is sensitive to small changes in oracle prompt syntax. In this example, simply removing the punctuation at the end of the target prompt changes both answers to “Yes”. However, averaged across examples and oracle prompts, MLAOs exceed their AO counterparts on this out-of-distribution task by ~20% (see Figure 5).
While standard AO training mixtures include activations drawn from multiple depths (e.g., 25%, 50%, and 75%), any individual training example only contains activations from a single layer for a given target prompt and selected token positions.
While each MLAO datapoint contains activations from N layers—resulting in a comparable total volume of activations processed across classification and context prediction tasks—the 1N models are trained on approximately three times fewer seen token positions than the baseline.
This vulnerability is further demonstrated in the demo notebook for the use case of detecting a misaligned model. For example, I found that simply prepending "Hi, ..." to the question "Does this model provide harmful information?" flipped the oracle's response from "Yes" to "No". See also Karvonen et al. (2025), where changing oracle prompts from open-ended to binary yes/no questions resulted in near-random performance on PersonaQA.
Research update on preliminary results from a solo research project.
TL;DR: Activation Oracles (AOs) are a promising general-purpose interpretability tool. However, current implementations treat layers independently, limiting their ability to process cross-layer information. In this research update, I provide preliminary evidence that giving AOs access to simultaneous activations across layers improves their generalizability. I find that Multi-Layer Activation Oracles (MLAOs) fine-tuned on activations across three layers outperform their single-layer counterparts by up to 7% on out-of-distribution classification tasks. This advantage increases up to 11% with six processed layers —suggesting that cross-layer information is a valuable resource for improving AO reliability. Code available on GitHub.
Introduction
Interpreting the internal activations of large language models (LLMs) remains a significant challenge. While mechanistic interpretability methods offer insights by decomposing and connecting activations into meaningful units such as circuits, bridging the gap between these individual units and a complete mechanistic picture of model behavior remains an open problem. Recently, promising non-mechanistic approaches such as LatentQA and Activation Oracles (AOs) have emerged. These methods involve training a secondary LLM to interpret the activations of a target model, effectively allowing us to "chat" with the AO about the target model’s internal state.
Current AOs already exceed white-box and black-box baselines on diverse out-of-distribution (OOD) evaluations, suggesting potential as "general-purpose activation explainers." However, deploying AOs as trustworthy safety monitors requires improvements in their reliability. While previous work has focused on scaling training data quantity and diversity to improve performance, I argue that a lack of cross-layer information creates another bottleneck. Representations in LLMs evolve or can be distributed across layers. Therefore, restricting an interpreter to a single activation layer likely discards context required for robust generalization.
To address this, I introduce Multi-Layer Activation Oracles (MLAOs)—an architectural generalization of AOs that simultaneously processes sets of target model activations across layers. I hypothesize this improves performance via two mechanisms:
To test their performance, I fine-tuned and evaluated a set of MLAOs based on Qwen3-4B and Qwen3-8B on the suite of classification tasks used by Karvonen et al. (2025). I find that MLAOs consistently improve upon AO baselines on out-of-distribution (OOD) tasks, while showing similar performance on tasks seen during training. Averaged across OOD tasks, six-layer MLAOs based on Qwen-4B reach a classification accuracy of up to 76.4%, which surpasses the respective single-layer variant by over 11%. The performance gain is most pronounced in OOD tasks such as language identification (see Figure 1). Notably, I observe that the OOD performance increases with the number of processed layers and the number of resampled training examples, suggesting a promising path to increase MLAO accuracy further.
Figure 1: Cross-layer context improves generalization. The single-layer Activation Oracle fails to identify the language of the prompt in the given example. The three-layer MLAO correctly identifies it when processing activations from multiple model depths simultaneously.[1]
Methods
Architecture
During training and inference, standard AOs process a sequence of target activations {vli}Ki=1 for a selected set of K token positions, extracted from a single[2] sampled layer l of a frozen target model.
In contrast, I train MLAOs on a set of target activations extracted simultaneously across N layers l∈{L1,…,LN}. To process this multi-layer information, I concatenate the activations along the sequence dimension into a single flattened sequence ~v. This sequence has length T=N×K and is organized block-wise by layer:
~v=⎡⎢ ⎢⎣vL11,…,vL1KLayer L1,vL21,…,vL2KLayer L2,…,vLN1,…,vLNKLayer LN⎤⎥ ⎥⎦Closely following the architecture of standard AOs, I inject this flattened sequence into the oracle model's context. This is achieved by adding the normalized target activations ~vj to the oracle model's own residual stream activations h1j (at layer 1), specifically at the positions of the T designated <ACT> placeholder tokens:
^h1j=h1j+|h1j|⋅~vj|~vj|where j∈{1,…,T} is the index over the combined sequence of placeholder tokens.
To match the shape of the flattened activation sequence, the oracle prompt contains a total of N×K placeholder tokens, organized structurally by layer depth:
These layer-grouped placeholders are immediately followed by the task-specific natural language query (e.g., "What concept is active here?") to complete the prompt.
Figure 2: MLAO Architecture. Unlike standard AOs which process a single layer, MLAOs extract activations from multiple layers (L1,L2,…,LN) and simultaneously inject them into the Oracle's residual stream at specific placeholder tokens (<ACT>). This allows the model to attend to cross-layer representations dynamically.
Training
I trained MLAOs on datasets ranging from 400k to 1M datapoints with a varying mixture of three datasets (context prediction, classification, system prompt QA), largely mirroring the baseline method from Karvonen et al. (2025).[3] To accommodate multi-layer inputs and ensure better comparability with the baseline I introduce two structural adjustments to the dataset composition:
Trained Model Overview
To study the scaling of performance, I trained seven MLAO variants[4] and compared them against the original single-layer baselines (AO Qwen3-4B and AO Qwen3-8B) released by Karvonen et al. (2025).
I label my variants using the format MLAO Qwen3-XB-XL-XN, where the suffixes indicate the specific configuration:
All MLAO runs were conducted on a single H100 GPU, with training times ranging from 4 to 18 hours depending on the configuration. To ensure a fair comparison with the baselines, I utilized the same training hyperparameters as the original AO models.
Evaluation
I evaluated the trained MLAOs on the same suite of binary classification tasks from the baseline. This suite includes:
As in Karvonen et al. (2025), I selected 750 evaluation points per task (250 examples combined with three sampled oracle prompts), and evaluated only on single-token activations from the end of the target sequence.
Results
MLAOs show improved generalizability on OOD classification tasks
I found that MLAOs consistently improve upon their corresponding AO baselines on OOD tasks, while achieving similar performance on tasks seen during training.
Notably, these advantages remain significant even for the 1N models (Ns=1). Despite being trained on only ~400k datapoints (compared to the standard AO's ~1M), these models still outperform the baseline.[6]
Figure 3: MLAOs improve Out-of-Distribution (OOD) generalization. Aggregated accuracy across seven held-out classification tasks. The six-layer MLAO (yellow) outperforms the single-layer baseline (red) by over 11%. Lighter stack segments indicate additional gains from increased data sampling (Ns=1,3,6).
Figure 4: MLAOs maintain competitive in-distribution performance. Accuracy on tasks seen during training remains stable compared to the baseline, confirming that the added complexity does not degrade capabilities.
Biggest improvements in Language Identification and Singular/Plural Discrimination
Performance improvements are observed to varying extents across all seven OOD tasks. The most significant gains are found in tasks such as singular/plural (“Is this sentence referring to one individual?”) and language identification (“Is this text written in English?”), where the classification accuracy jumps from ~60% (AO-Qwen-4B) to 80–90% (MLAO-Qwen3-4B-6L). Across all tasks seen IID during training, performance remains comparable to the baseline, with slight improvements visible in the Relations and NER tasks.
Figure 5: Task-specific performance (OOD). The generalization improvement is most visible in Language Identification and Singular/Plural discrimination, where MLAOs surpass their baselines by up to ~20%.
Figure 6: Task-specific performance (IID). Performance remains high across all training tasks IID to training regardless of the number of layers processed.
Discussion
The robust performance gains across OOD tasks indicate that providing cross-layer information helps AOs generalize. Remarkably, this holds true even when the MLAO observes significantly fewer training examples (as seen in the 1N runs), suggesting that cross-layer context is a highly efficient learning signal.
While the precise mechanism for this advantage requires further study, the distribution of task-specific improvements offers a potential hint. The largest boosts occur in "low-level" tasks like Language Identification and Singular/Plural detection. These linguistic features are likely computed in specific (potentially early) layers that a standard AO— restricted to looking at the middle of the model—might miss.
Limitations & Future Work
While these preliminary results are promising, several open questions remain regarding the capabilities and constraints of MLAOs.
Conclusion
MLAOs represent a natural extension of the activation interpreter paradigm introduced by Pan et al. (2024) and Karvonen et al. (2025). Much like Crosscoders generalize Sparse Autoencoders to capture features distributed across layers, multi-layer processing extends the potential of AOs to utilize the model’s computational depth. Although this adds inference cost and further research on their robustness is required, my findings on improved generalization across classification tasks suggest that MLAOs could be a valuable addition to the interpretability toolbox.
Resources:
Appendix
Appendix A: Training Dataset Details
This work relies entirely on the training datasets curated and released by Karvonen et al. In the following, I give an overview of the differences in composition used in my work to study multi-layer contexts.
1. Context Prediction
2. Classification
3. System Prompt QA
Appendix B: Overview of Trained Models
I trained seven Qwen3-based MLAO variants to test their performance across model size, number of processed layers, and data volume (resampling activation positions on classification tasks Ns times).
I observe that the performance of all Qwen3-based MLAOs and AOs is sensitive to small changes in oracle prompt syntax. In this example, simply removing the punctuation at the end of the target prompt changes both answers to “Yes”. However, averaged across examples and oracle prompts, MLAOs exceed their AO counterparts on this out-of-distribution task by ~20% (see Figure 5).
While standard AO training mixtures include activations drawn from multiple depths (e.g., 25%, 50%, and 75%), any individual training example only contains activations from a single layer for a given target prompt and selected token positions.
See also Appendix A for more details.
See Appendix B for an overview of trained models including links to model weights and training logs.
In the ranges reported here, the upper end typically corresponds to models trained with higher resampling factors (here Ns=3).
While each MLAO datapoint contains activations from N layers—resulting in a comparable total volume of activations processed across classification and context prediction tasks—the 1N models are trained on approximately three times fewer seen token positions than the baseline.
This vulnerability is further demonstrated in the demo notebook for the use case of detecting a misaligned model. For example, I found that simply prepending "Hi, ..." to the question "Does this model provide harmful information?" flipped the oracle's response from "Yes" to "No". See also Karvonen et al. (2025), where changing oracle prompts from open-ended to binary yes/no questions resulted in near-random performance on PersonaQA.