49

LESSWRONG
LW

48
Interpretability (ML & AI)Machine Learning (ML)AI

1

ATTENTION GATHERS, MLPS COMPOSE: A CAUSAL ANALYSIS OF AN ACTION-OUTCOME CIRCUIT IN VIDEOVIT

by Sai Chereddy
11th Oct 2025
8 min read
0

1

This post was rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.

So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*

"English is my second language, I'm using this to translate"

If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly. 

"What if I think this was a mistake?"

For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.

  1. you wrote this yourself (not using LLMs to help you write it)
  2. you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
  3. your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics. 

If any of those are false, sorry, we will not accept your post. 

* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)

1

New Comment
Moderation Log
More from Sai Chereddy
View more
Curated and popular this week
0Comments
Interpretability (ML & AI)Machine Learning (ML)AI
Sai Chereddy
Independent Researcher
Navi Mumbai, India
saivivaswanthreddy@alumni.usc.edu

ABSTRACT

This paper investigates how video models represent nuanced information that does not alter the final classification. Using a minimal pair of videos—a bowling strike and a gutter ball—we analyze the Vanilla Video Vision Transformer (google/vivit-b-16x2-kinetics400) model, which robustly classifies both as "bowling." We reverse-engineer the internal circuit responsible for representing the action’s outcome, revealing that the "Success vs. Failure" signal is computed in a distinct amplification cascade. While low-level differences exist from Layer 0, the abstract, semantic representation of the outcome is progressively amplified from Layer 5 to 11. Through causal analysis using activation patching, we discover a clear division of labor: Attention Heads act as "evidence gatherers," providing necessary low-level information for a partial signal recovery, while MLP Blocks function as robust "concept composers," each sufficient to generate the entire "success" signal. This distributed and redundant circuit explains the model’s resilience to simple ablations and demonstrates a core computational pattern for processing action outcomes.

Keywords: Video Vision Transformers · Mechanistic Interpretability · Causal Analysis · Hidden Knowledge Elicitation

1.Introduction

A pre-trained VideoViT model correctly and robustly classifies both a successful bowling "strike" and a failed "gutter" ball under the single label "bowling". While this classification is accurate based on training, the two videos represent a semantically opposite, contrastive pair of positive-negative outcomes. This presents a fascinating puzzle: does the model simply discard this outcome-specific information, or does it compute and represent this nuance internally, hidden beneath the surface of its final classification? This work aims to answer that question by reverse-engineering the internal circuits of the model to search for this hidden knowledge. My investigation reveals critical insights not from initial successes, but from the failure of simple methods, which exposed the sophisticated, robust nature of the model's algorithm and compelled a more rigorous causal approach.

2. From Failed Experiments to a Clear Signal

Initial explorations using Direct Logit Attribution (DLA) and CLS Token Visualization successfully identified what the model was looking at—the spatio-temporal region of the ball-pin interaction—but not how it processed this information. Simple causal interventions failed: ablating these critical regions had a negligible effect on the final logit, demonstrating the model’s robustness (see Appendix). Furthermore, a linear probe trained on CLS tokens achieved perfect but misleading accuracy, acting as a "fingerprint scanner" for pixel-level differences rather than identifying a semantic concept. These failures necessitated a more precise method. The breakthrough came from analyzing the L2 norm of the activation delta:

Δ=act_strike−act_gutter

between the two videos. This approach filtered out low-level noise and revealed a clear "signal amplification cascade" from Layer 5 to 11, pinpointing the location of the outcome-computing circuit.

Figure 1: The average L2 norm of the activation delta between the "strike" and "gutter" runs. The sharp rise from Layer 5 onwards shows the amplification cascade.

3. The Causal Mechanism: A Division of Labor

Having located the signal, we used activation patching to determine how it was built. By patching single components (Attention vs. MLP blocks) from the "strike" run into the "gutter" run, we measured the percentage of the final signal recovered. The results reveal a fundamental division of labor:

Attention Heads act as Evidence Gatherers: Patching attention blocks makes a significant but partial contribution ( 45-65% signal recovery). Their role is to move relevant spatio-temporal evidence onto the residual stream.

MLPs act as Concept Composers: Patching any single MLP block from Layer 4 onward is sufficient to create the entire "success" signal, achieving over 100% recovery. Each MLP can independently and robustly compute the abstract outcome concept from the available evidence.

Figure 2: Causal effect of patching individual components on the Layer 11 signal. MLP blocks are causally sufficient to create the outcome signal.

 

4. Discussion & Limitations

4.1 Conclusion

This investigation successfully reverse-engineered a core computational pattern in VideoViT. The model represents nuanced action outcomes not via a single, fragile circuit, but through a robust, distributed algorithm where Attention Gathers, MLPs Compose. This discovery causally explains the failure of simple ablation experiments; the model’s resilience comes from the powerful, redundant cascade of its MLP blocks.

4.2 Limitations

The primary limitation of this work is its narrow scope. The findings are based on a single minimal pair of videos ("strike" vs. "gutter") and one specific pre-trained model (google/vivit-b-16x2-kinetics400) available on HuggingFace. Although this controlled setting was necessary to isolate the circuit, the generalizability of the "Attention Gathers, MLPs Compose" hypothesis to other actions, domains, and model architectures remains unproven. Furthermore, this analysis is based on a single run for each video; due to the deterministic nature of the experiment, statistical significance tests or error bars are not applicable but would be necessary in a broader study with multiple video examples per class.

4.3 Future Directions

The most critical next step is to test for the generalization of our findings. A key open question is whether the "Attention Gathers, MLPs Compose" pattern represents a universal mechanism for processing outcomes in video models. The logical next experiment to validate this hypothesis would be to rerun this analysis on a new minimal pair from a different domain, such as a successful versus a missed golf swing. Further work could also investigate this pattern across different model architectures and sizes to understand how the circuit scales and evolves.

5. Broader Impact Statement

This work serves as an empirical case study for detecting "hidden cognition" in AI models, where a model’s internal representations (e.g., "success" vs. "failure") are more nuanced than its final output ("bowling"). The techniques used offer a path toward scalable oversight for identifying discrepancies between a model’s behavior and its internal state, a key challenge in AI safety. Moreover, the finding that the outcome-concept is computed robustly and redundantly by an MLP cascade suggests that naive safety interventions, like ablating a single "harmful" component, are likely to fail in practice, highlighting the need for more sophisticated approaches.

Acknowledgments

I’d like to thank Eitan Sprejer, my facilitator and project mentor, for his precise and routine feedback. 

References

  1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. NeurIPS.
  2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
  3. Arnab, A., Dehghani, M., Heigold, G., et al. (2021). ViViT: A Video Vision Transformer. ICCV.
  4. Olah, C., Cammarata, N., et al. (2020). Circuits: A New Way to Think About Neural Networks. Distill.pub.
  5. Olsson, C., et al. (2022). In-context Learning and Induction Heads. Anthropic.
  6. Meng, K., et al. (2022). Causal Tracing with Language Models. arXiv preprint.
  7. Burns, C., et al. (2022). Discovering Latent Knowledge in Language Models Without Supervision. arXiv preprint.
  8. Geva, M., et al. (2020). Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP.
  9. Lindsey, Jack., et al. (2025). On the Biology of  a Large Langugage Model. Anthopic Circuits Thread.
  10. Fred Zhang, Neel Nanda (2024). Towards Best Practices of Activation Patching in Language Models: Metrics and Methods. ICLR.

     

Appendix

A. Experiment Details

A.1 Model & Data

The model used for all experiments is the google/vivit-b-16x2-kinetics400, a pre-trained Video Vision Transformer on the Kinetics-400 human-action dataset available on HuggingFace. The analysis was conducted on a minimal contrastive pair of videos: one depicting a successful bowling strike and the other a gutter ball both belonging to the same class "bowling".

Video 1, Strike (positive sample) is here.

Video 2, Gutter (negative sample) is available here.

A.2 Code and Reproducibility

The full code used for the analysis, including activation patching and visualizations in Google Colab notebook with code is available at this link.

A.3 Software and Libraries

The experiments were conducted using the following key libraries:

Python (version 3.12.11)

PyTorch (version 2.8.0)

Transformers (by Hugging Face, version 4.56.2)

B. Exploratory Visualizations

B.1 Direct Logit Attribution

Direct Logit Attribution (DLA) was initially used to identify which spatio-temporal regions were most influential for the model’s classification. As shown in Figure 3, the heatmaps revealed that the highest contributions to the "bowling" logit came from tokens corresponding to the ball and pins in the later frames of the video, near the moment of impact.

Figure 3: Direct Logit Attribution heatmap for the "bowling strike" video. The bright yellow areas in the later frames (y-axis) and specific token indices (x-axis) highlight the spatio-temporal region of the ball-pin interaction as having the highest contribution to the final logit.

B.2 CLS Token Attention

To validate the DLA findings, the attention patterns of the [CLS] token in the final layers were visualized. These patterns confirmed that the [CLS] token aggregated information for its final decision by paying the most attention to the same critical spatio-temporal region identified by DLA.

Figure 4: Attention patterns from the [CLS] token for various heads in Layers 9 and 10. The concentrated attention on patches in the later frames confirms that the model is focusing on the ball-pin interaction to make its classification.

C. Supporting Data for Failed Causal Methods

C.1 Ablation of Critical Regions

Ablation experiments were conducted by manually zeroing out the critical spatio-temporal regions identified by DLA. This intervention had a negligible effect on the final "bowling" logit, demonstrating that the model was too robust for this simple technique. Table 1 shows that the logit for the correct class remained virtually unchanged after ablation.

Table 1: Comparison of top-5 class logits for the "bowling strike" video before and after ablating the critical ball-pin interaction region. The change in the logit for the correct label (LABEL 31) is negligible .

Prediction RankOriginal Predictions (Logit)Ablated Predictions (Logit)Change in Logits
1LABEL 31 (16.6881)LABEL 31 (16.8257)0.1376
2LABEL 357 (8.5500)LABEL 357 (8.4932)-0.0569
3LABEL 84 (6.1969)LABEL 84 (6.1552)-0.0417
4LABEL 45 (6.1204)LABEL 45 (6.1848)0.0645
5LABEL 227 (6.0944)LABEL 227 (6.0417)-0.0527
6LABEL 101 (5.8890)LABEL 101 (5.9120)0.023

C.2 Naive Linear Probing

A linear probe was trained to distinguish the [CLS] token activations of the "strike" versus the "gutter" video. While the probe achieved 100% accuracy from Layer 0 onwards, this result was misleading. The probe was not identifying a semantic concept of "success vs. failure," but was instead acting as a "fingerprint scanner" for low-level, pixel-based differences between the two distinct video files.

5

Figure 5: Test accuracy of a linear probe trained to classify the [CLS] token from the "strike" vs. "gutter" video at each layer of the transformer. The perfect accuracy across all layers indicates the probe is using superficial, video-specific features, not a learned semantic representation.

D. Causal Effect Numerical Results

This table provides the precise numerical data for the causal patching experiments visualized in Figure 2 of the main paper. The "Signal Recovery (%)" column indicates the percentage of the final Layer 11 signal delta that was recovered by patching a single component from the "strike" run into the "gutter" run.

Table 2: Patching Experiment Results showing causal effects per layer and component.

LayerComponentLabelCausal Effect (% Recovery)
4AttentionL4 Attention57.08
4MLPL4 MLP101.88
5AttentionL5 Attention59.78
5MLPL5 MLP101.88
6AttentionL6 Attention45.44
6MLPL6 MLP101.88
7AttentionL7 Attention49.13
7MLPL7 MLP101.88
8AttentionL8 Attention49.7
8MLPL8 MLP101.88
9AttentionL9 Attention53.63
9MLPL9 MLP101.88
10AttentionL10 Attention64.18
10MLPL10 MLP101.88