This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Sai Chereddy
Independent Researcher
Navi Mumbai, India
saivivaswanthreddy@alumni.usc.edu
ABSTRACT
This paper investigates how video models represent nuanced information that does not alter the final classification. Using a minimal pair of videos—a bowling strike and a gutter ball—we analyze the Vanilla Video Vision Transformer (google/vivit-b-16x2-kinetics400) model, which robustly classifies both as "bowling." We reverse-engineer the internal circuit responsible for representing the action’s outcome, revealing that the "Success vs. Failure" signal is computed in a distinct amplification cascade. While low-level differences exist from Layer 0, the abstract, semantic representation of the outcome is progressively amplified from Layer 5 to 11. Through causal analysis using activation patching, we discover a clear division of labor: Attention Heads act as "evidence gatherers," providing necessary low-level information for a partial signal recovery, while MLP Blocks function as robust "concept composers," each sufficient to generate the entire "success" signal. This distributed and redundant circuit explains the model’s resilience to simple ablations and demonstrates a core computational pattern for processing action outcomes.
A pre-trained VideoViT model correctly and robustly classifies both a successful bowling "strike" and a failed "gutter" ball under the single label "bowling". While this classification is accurate based on training, the two videos represent a semantically opposite, contrastive pair of positive-negative outcomes. This presents a fascinating puzzle: does the model simply discard this outcome-specific information, or does it compute and represent this nuance internally, hidden beneath the surface of its final classification? This work aims to answer that question by reverse-engineering the internal circuits of the model to search for this hidden knowledge. My investigation reveals critical insights not from initial successes, but from the failure of simple methods, which exposed the sophisticated, robust nature of the model's algorithm and compelled a more rigorous causal approach.
2. From Failed Experiments to a Clear Signal
Initial explorations using Direct Logit Attribution (DLA) and CLS Token Visualization successfully identified what the model was looking at—the spatio-temporal region of the ball-pin interaction—but not how it processed this information. Simple causal interventions failed: ablating these critical regions had a negligible effect on the final logit, demonstrating the model’s robustness (see Appendix). Furthermore, a linear probe trained on CLS tokens achieved perfect but misleading accuracy, acting as a "fingerprint scanner" for pixel-level differences rather than identifying a semantic concept. These failures necessitated a more precise method. The breakthrough came from analyzing the L2 norm of the activation delta:
Δ=act_strike−act_gutter
between the two videos. This approach filtered out low-level noise and revealed a clear "signal amplification cascade" from Layer 5 to 11, pinpointing the location of the outcome-computing circuit.
Figure 1: The average L2 norm of the activation delta between the "strike" and "gutter" runs. The sharp rise from Layer 5 onwards shows the amplification cascade.
3. The Causal Mechanism: A Division of Labor
Having located the signal, we used activation patching to determine how it was built. By patching single components (Attention vs. MLP blocks) from the "strike" run into the "gutter" run, we measured the percentage of the final signal recovered. The results reveal a fundamental division of labor:
Attention Heads act as Evidence Gatherers: Patching attention blocks makes a significant but partial contribution ( 45-65% signal recovery). Their role is to move relevant spatio-temporal evidence onto the residual stream.
MLPs act as Concept Composers: Patching any single MLP block from Layer 4 onward is sufficient to create the entire "success" signal, achieving over 100% recovery. Each MLP can independently and robustly compute the abstract outcome concept from the available evidence.
Figure 2: Causal effect of patching individual components on the Layer 11 signal. MLP blocks are causally sufficient to create the outcome signal.
4. Discussion & Limitations
4.1 Conclusion
This investigation successfully reverse-engineered a core computational pattern in VideoViT. The model represents nuanced action outcomes not via a single, fragile circuit, but through a robust, distributed algorithm where Attention Gathers, MLPs Compose. This discovery causally explains the failure of simple ablation experiments; the model’s resilience comes from the powerful, redundant cascade of its MLP blocks.
4.2 Limitations
The primary limitation of this work is its narrow scope. The findings are based on a single minimal pair of videos ("strike" vs. "gutter") and one specific pre-trained model (google/vivit-b-16x2-kinetics400) available on HuggingFace. Although this controlled setting was necessary to isolate the circuit, the generalizability of the "Attention Gathers, MLPs Compose" hypothesis to other actions, domains, and model architectures remains unproven. Furthermore, this analysis is based on a single run for each video; due to the deterministic nature of the experiment, statistical significance tests or error bars are not applicable but would be necessary in a broader study with multiple video examples per class.
4.3 Future Directions
The most critical next step is to test for the generalization of our findings. A key open question is whether the "Attention Gathers, MLPs Compose" pattern represents a universal mechanism for processing outcomes in video models. The logical next experiment to validate this hypothesis would be to rerun this analysis on a new minimal pair from a different domain, such as a successful versus a missed golf swing. Further work could also investigate this pattern across different model architectures and sizes to understand how the circuit scales and evolves.
5. Broader Impact Statement
This work serves as an empirical case study for detecting "hidden cognition" in AI models, where a model’s internal representations (e.g., "success" vs. "failure") are more nuanced than its final output ("bowling"). The techniques used offer a path toward scalable oversight for identifying discrepancies between a model’s behavior and its internal state, a key challenge in AI safety. Moreover, the finding that the outcome-concept is computed robustly and redundantly by an MLP cascade suggests that naive safety interventions, like ablating a single "harmful" component, are likely to fail in practice, highlighting the need for more sophisticated approaches.
Acknowledgments
I’d like to thank Eitan Sprejer, my facilitator and project mentor, for his precise and routine feedback.
References
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. NeurIPS.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
Arnab, A., Dehghani, M., Heigold, G., et al. (2021). ViViT: A Video Vision Transformer. ICCV.
Olah, C., Cammarata, N., et al. (2020). Circuits: A New Way to Think About Neural Networks. Distill.pub.
Olsson, C., et al. (2022). In-context Learning and Induction Heads. Anthropic.
Meng, K., et al. (2022). Causal Tracing with Language Models. arXiv preprint.
Burns, C., et al. (2022). Discovering Latent Knowledge in Language Models Without Supervision. arXiv preprint.
Geva, M., et al. (2020). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP.
Lindsey, Jack., et al. (2025). On the Biology of a Large Langugage Model. Anthopic Circuits Thread.
Fred Zhang, Neel Nanda (2024). Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. ICLR.
Appendix
A. Experiment Details
A.1 Model & Data
The model used for all experiments is the google/vivit-b-16x2-kinetics400, a pre-trained Video Vision Transformer on the Kinetics-400 human-action dataset available on HuggingFace. The analysis was conducted on a minimal contrastive pair of videos: one depicting a successful bowling strike and the other a gutter ball both belonging to the same class "bowling".
Video 2, Gutter (negative sample) is available here.
A.2 Code and Reproducibility
The full code used for the analysis, including activation patching and visualizations in Google Colab notebook with code is available at this link.
A.3 Software and Libraries
The experiments were conducted using the following key libraries:
Python (version 3.12.11)
PyTorch (version 2.8.0)
Transformers (by Hugging Face, version 4.56.2)
B. Exploratory Visualizations
B.1 Direct Logit Attribution
Direct Logit Attribution (DLA) was initially used to identify which spatio-temporal regions were most influential for the model’s classification. As shown in Figure 3, the heatmaps revealed that the highest contributions to the "bowling" logit came from tokens corresponding to the ball and pins in the later frames of the video, near the moment of impact.
Figure 3: Direct Logit Attribution heatmap for the "bowling strike" video. The bright yellow areas in the later frames (y-axis) and specific token indices (x-axis) highlight the spatio-temporal region of the ball-pin interaction as having the highest contribution to the final logit.
B.2 CLS Token Attention
To validate the DLA findings, the attention patterns of the [CLS] token in the final layers were visualized. These patterns confirmed that the [CLS] token aggregated information for its final decision by paying the most attention to the same critical spatio-temporal region identified by DLA.
Figure 4: Attention patterns from the [CLS] token for various heads in Layers 9 and 10. The concentrated attention on patches in the later frames confirms that the model is focusing on the ball-pin interaction to make its classification.
C. Supporting Data for Failed Causal Methods
C.1 Ablation of Critical Regions
Ablation experiments were conducted by manually zeroing out the critical spatio-temporal regions identified by DLA. This intervention had a negligible effect on the final "bowling" logit, demonstrating that the model was too robust for this simple technique. Table 1 shows that the logit for the correct class remained virtually unchanged after ablation.
Table 1: Comparison of top-5 class logits for the "bowling strike" video before and after ablating the critical ball-pin interaction region. The change in the logit for the correct label (LABEL 31) is negligible .
Prediction Rank
Original Predictions (Logit)
Ablated Predictions (Logit)
Change in Logits
1
LABEL 31 (16.6881)
LABEL 31 (16.8257)
0.1376
2
LABEL 357 (8.5500)
LABEL 357 (8.4932)
-0.0569
3
LABEL 84 (6.1969)
LABEL 84 (6.1552)
-0.0417
4
LABEL 45 (6.1204)
LABEL 45 (6.1848)
0.0645
5
LABEL 227 (6.0944)
LABEL 227 (6.0417)
-0.0527
6
LABEL 101 (5.8890)
LABEL 101 (5.9120)
0.023
C.2 Naive Linear Probing
A linear probe was trained to distinguish the [CLS] token activations of the "strike" versus the "gutter" video. While the probe achieved 100% accuracy from Layer 0 onwards, this result was misleading. The probe was not identifying a semantic concept of "success vs. failure," but was instead acting as a "fingerprint scanner" for low-level, pixel-based differences between the two distinct video files.
5
Figure 5: Test accuracy of a linear probe trained to classify the [CLS] token from the "strike" vs. "gutter" video at each layer of the transformer. The perfect accuracy across all layers indicates the probe is using superficial, video-specific features, not a learned semantic representation.
D. Causal Effect Numerical Results
This table provides the precise numerical data for the causal patching experiments visualized in Figure 2 of the main paper. The "Signal Recovery (%)" column indicates the percentage of the final Layer 11 signal delta that was recovered by patching a single component from the "strike" run into the "gutter" run.
Table 2: Patching Experiment Results showing causal effects per layer and component.
ABSTRACT
This paper investigates how video models represent nuanced information that does not alter the final classification. Using a minimal pair of videos—a bowling strike and a gutter ball—we analyze the Vanilla Video Vision Transformer (google/vivit-b-16x2-kinetics400) model, which robustly classifies both as "bowling." We reverse-engineer the internal circuit responsible for representing the action’s outcome, revealing that the "Success vs. Failure" signal is computed in a distinct amplification cascade. While low-level differences exist from Layer 0, the abstract, semantic representation of the outcome is progressively amplified from Layer 5 to 11. Through causal analysis using activation patching, we discover a clear division of labor: Attention Heads act as "evidence gatherers," providing necessary low-level information for a partial signal recovery, while MLP Blocks function as robust "concept composers," each sufficient to generate the entire "success" signal. This distributed and redundant circuit explains the model’s resilience to simple ablations and demonstrates a core computational pattern for processing action outcomes.
Keywords: Video Vision Transformers · Mechanistic Interpretability · Causal Analysis · Hidden Knowledge Elicitation
1.Introduction
A pre-trained VideoViT model correctly and robustly classifies both a successful bowling "strike" and a failed "gutter" ball under the single label "bowling". While this classification is accurate based on training, the two videos represent a semantically opposite, contrastive pair of positive-negative outcomes. This presents a fascinating puzzle: does the model simply discard this outcome-specific information, or does it compute and represent this nuance internally, hidden beneath the surface of its final classification? This work aims to answer that question by reverse-engineering the internal circuits of the model to search for this hidden knowledge. My investigation reveals critical insights not from initial successes, but from the failure of simple methods, which exposed the sophisticated, robust nature of the model's algorithm and compelled a more rigorous causal approach.
2. From Failed Experiments to a Clear Signal
Initial explorations using Direct Logit Attribution (DLA) and CLS Token Visualization successfully identified what the model was looking at—the spatio-temporal region of the ball-pin interaction—but not how it processed this information. Simple causal interventions failed: ablating these critical regions had a negligible effect on the final logit, demonstrating the model’s robustness (see Appendix). Furthermore, a linear probe trained on CLS tokens achieved perfect but misleading accuracy, acting as a "fingerprint scanner" for pixel-level differences rather than identifying a semantic concept. These failures necessitated a more precise method. The breakthrough came from analyzing the L2 norm of the activation delta:
Δ=act_strike−act_gutter
between the two videos. This approach filtered out low-level noise and revealed a clear "signal amplification cascade" from Layer 5 to 11, pinpointing the location of the outcome-computing circuit.
3. The Causal Mechanism: A Division of Labor
Having located the signal, we used activation patching to determine how it was built. By patching single components (Attention vs. MLP blocks) from the "strike" run into the "gutter" run, we measured the percentage of the final signal recovered. The results reveal a fundamental division of labor:
Attention Heads act as Evidence Gatherers: Patching attention blocks makes a significant but partial contribution ( 45-65% signal recovery). Their role is to move relevant spatio-temporal evidence onto the residual stream.
MLPs act as Concept Composers: Patching any single MLP block from Layer 4 onward is sufficient to create the entire "success" signal, achieving over 100% recovery. Each MLP can independently and robustly compute the abstract outcome concept from the available evidence.
4. Discussion & Limitations
4.1 Conclusion
This investigation successfully reverse-engineered a core computational pattern in VideoViT. The model represents nuanced action outcomes not via a single, fragile circuit, but through a robust, distributed algorithm where Attention Gathers, MLPs Compose. This discovery causally explains the failure of simple ablation experiments; the model’s resilience comes from the powerful, redundant cascade of its MLP blocks.
4.2 Limitations
The primary limitation of this work is its narrow scope. The findings are based on a single minimal pair of videos ("strike" vs. "gutter") and one specific pre-trained model (google/vivit-b-16x2-kinetics400) available on HuggingFace. Although this controlled setting was necessary to isolate the circuit, the generalizability of the "Attention Gathers, MLPs Compose" hypothesis to other actions, domains, and model architectures remains unproven. Furthermore, this analysis is based on a single run for each video; due to the deterministic nature of the experiment, statistical significance tests or error bars are not applicable but would be necessary in a broader study with multiple video examples per class.
4.3 Future Directions
The most critical next step is to test for the generalization of our findings. A key open question is whether the "Attention Gathers, MLPs Compose" pattern represents a universal mechanism for processing outcomes in video models. The logical next experiment to validate this hypothesis would be to rerun this analysis on a new minimal pair from a different domain, such as a successful versus a missed golf swing. Further work could also investigate this pattern across different model architectures and sizes to understand how the circuit scales and evolves.
5. Broader Impact Statement
This work serves as an empirical case study for detecting "hidden cognition" in AI models, where a model’s internal representations (e.g., "success" vs. "failure") are more nuanced than its final output ("bowling"). The techniques used offer a path toward scalable oversight for identifying discrepancies between a model’s behavior and its internal state, a key challenge in AI safety. Moreover, the finding that the outcome-concept is computed robustly and redundantly by an MLP cascade suggests that naive safety interventions, like ablating a single "harmful" component, are likely to fail in practice, highlighting the need for more sophisticated approaches.
Acknowledgments
I’d like to thank Eitan Sprejer, my facilitator and project mentor, for his precise and routine feedback.
References
Fred Zhang, Neel Nanda (2024). Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. ICLR.
Appendix
A. Experiment Details
A.1 Model & Data
The model used for all experiments is the google/vivit-b-16x2-kinetics400, a pre-trained Video Vision Transformer on the Kinetics-400 human-action dataset available on HuggingFace. The analysis was conducted on a minimal contrastive pair of videos: one depicting a successful bowling strike and the other a gutter ball both belonging to the same class "bowling".
Video 1, Strike (positive sample) is here.
Video 2, Gutter (negative sample) is available here.
A.2 Code and Reproducibility
The full code used for the analysis, including activation patching and visualizations in Google Colab notebook with code is available at this link.
A.3 Software and Libraries
The experiments were conducted using the following key libraries:
Python (version 3.12.11)
PyTorch (version 2.8.0)
Transformers (by Hugging Face, version 4.56.2)
B. Exploratory Visualizations
B.1 Direct Logit Attribution
Direct Logit Attribution (DLA) was initially used to identify which spatio-temporal regions were most influential for the model’s classification. As shown in Figure 3, the heatmaps revealed that the highest contributions to the "bowling" logit came from tokens corresponding to the ball and pins in the later frames of the video, near the moment of impact.
B.2 CLS Token Attention
To validate the DLA findings, the attention patterns of the [CLS] token in the final layers were visualized. These patterns confirmed that the [CLS] token aggregated information for its final decision by paying the most attention to the same critical spatio-temporal region identified by DLA.
C. Supporting Data for Failed Causal Methods
C.1 Ablation of Critical Regions
Ablation experiments were conducted by manually zeroing out the critical spatio-temporal regions identified by DLA. This intervention had a negligible effect on the final "bowling" logit, demonstrating that the model was too robust for this simple technique. Table 1 shows that the logit for the correct class remained virtually unchanged after ablation.
C.2 Naive Linear Probing
A linear probe was trained to distinguish the [CLS] token activations of the "strike" versus the "gutter" video. While the probe achieved 100% accuracy from Layer 0 onwards, this result was misleading. The probe was not identifying a semantic concept of "success vs. failure," but was instead acting as a "fingerprint scanner" for low-level, pixel-based differences between the two distinct video files.
5
D. Causal Effect Numerical Results
This table provides the precise numerical data for the causal patching experiments visualized in Figure 2 of the main paper. The "Signal Recovery (%)" column indicates the percentage of the final Layer 11 signal delta that was recovered by patching a single component from the "strike" run into the "gutter" run.