The Intervention Paradox: An Inference Physics and Stratigraphic Model of LLM Coherence

Alexandra Cheung

Rejected for the following reason(s):

No Basic LLM Case Studies.
The content is almost always very similar.
Usually, the user is incorrect about how novel/interesting their case study is (i.
Most of these situations seem like they are an instance of Parasitic AI.
Insufficient Quality for AI Content.
Not obviously not Language Model.

Read full explanation

Epistemic Status: Exploratory

Confidence: I am confident about what I saw but I am not confident about the maths. I am just trying to propose a new lens that will be open to feedback.

Author’s Note on Composition and Context

I was a PhD student at Bristol, but my funding is gone and I’m currently almost destitute in a foreign country after the sudden end of a relationship. I developed this research on the inference physics of LLMs while I was extremely sick in November; somehow, the work was the only thing I could focus on despite the symptoms.
Since I can no longer continue my PhD, I’m hoping to get this theory out because it means a lot to me. However, I’m aware this was written in a vacuum. I likely have significant blind spots, so I’m putting it here for feedback.
And the whole passage is just created by me rambling so I am really sorry if it just sounds very disorganized. Welcome to my messy ADHD brain.

The Framework

The reason why I work on this is because I want to find a new way of understand LLM pathologies and why they are so unpredictable. I started with pure observation first (because back then I knew nothing about LLM) but since then I have looked into the mechanism instead.

I believe that LLM pathologies are not random, but instead they are phase transitions that happens when the amount of constraints just overwhelm LLM's stability bandwidth. I want to provide the following levers to help testing and understanding the dynamics of LLM.

Note: I am really sorry, my Latex sucks so I do not know how to render it well unless I ask AI, so no cool maths here. The maths is cooler in the paper.

The Dimensional Budget: Use Participation Ratio (PR) to turn the residual stream to a limit that has a hard quantitative number.
The Alignment Exchange Rate: Formalise Inhibitory Torque so you can calculate degree of freedom taken away from the substrate by your patches
The Manifold Stress Gauge: Use the Jacobian Condition Number to measure Inference Tension. The idea is that maybe it can detect any tearing in Manifold before the token is sampled.
The Median Sink Law: Use a new and more mechanistic instead of anthromphoric explanations like laziness or slacking, which sort of imply intent. But if this law is true, then the models actually just collapse into the most likely token patterns to avoid singularity because of physics, not agency
The Resolution Decay Theorem: Find a way to define when the number of constraints just get too much more than the current available rank and make it collapse due to mathematics.
Dissipative Dynamics: Use Hamiltonian framework to model Information Loss. The idea is that if we can see forward pass as a non-conservative system, then we can use it to measure inference and entropy during steering.
Stratigraphic Mapping: Come up with a new theory that goes beyond the current ontology that separates the Coherence Substrate (the global model) from the Evaluative Crust (alignment overlays on top), so you can just target right layer's tension.
The Empirical Benchmark: Created a python script to measuring these metrics in Llama-3 so we can find a way to red team it.

Navigation Map: Where to Find the Levers

I know the paper is super long and I do not want to everyone's precious time so I created a map to signpost. I know it is still long, but it should be a bit easier. This is where I think you should be able to find them:

Stability Bandwidth: Section 9.1 (Definition) and Section 3.3 (Equation 5). You can test it via the Participation Ratio is in Appendix.
The Alignment Exchange Rate: Section 2.1 (The Principle of Stationary Action). This is where you can find what Inhibitory Torque is despite the weird name and the gradient force required to decide the trajectory.
The Manifold Stress Gauge: Section 2.3 (Information Geometry) and Appendix. I also formalize Jacobian Condition Number for tension so it will be an operationalizable construct, not just a cool idea.
The Median Sink Law / Couch Potato Singularity: Section 22 (Synthesis) and Section 2. I think it should be where it explains what it means to be couch potato which is just a fun way of saying maximum stability and minimum energy expenditure. At least it was cute in my own head.
The Resolution Decay Theorem (Theorem 1): You can find it in Section 9.5. This is where you should be able to find the formal proof that Bavail can just go into 0.. It just means that the system will collapse into the coherence ground state.
The Inference Hamiltonian: Section 2.1. It is the one probably the most controversial one. I swear there is an explanation and it is not a metaphor. The idea is it treats the forward pass as a non-conservative system so it is still subject to Inhibitory Potential. If you have a better model than Hamiltonian, please let me know. I just could not think of a better one but I am sure it is my own problem.
Stratigraphic Mapping: See Section 9.7 (Architectural Stratification). This is where I try to sort of separate and define the latent space into a few parts: the Foundational Core, Intermediate Mantle, and Evaluative Crust.
The Intervention Paradox: This one. I talked about it a bit in the Abstract and I talked even more about it in detail in Section 9.7. It explains why adding more safety constraints can accelerate model collapse. But no it does not mean we should not have safety constraints. That would be absolutely horrible! Don't do that. It just means we need to integrate it better.
Rank-Exhaustion / Dimensional Collapse: This is where I formalise the collapse in Section 2.2 (Probability of Implosion) and Section 3.3. Got the inspiration of names from submarines. Don't ask.

Frequently Asked Question

Q: Do these physics-based terms even map to actual LLM architecture? Are you sure you are not just using metaphors because physics sounds cool?

A: Okay I know it really looks like metaphors. At least every AI I talked to tell me it sounds like poetry. And I cannot deny physics does sound cool. But actually I did try to ensure every term, no matter how much it just sounds like I am a wannabe physicist that tries to shove physics terms on codes, every term is actually defined and can be found in transformer architecture. It is not a mysterious construct flowing in the soul of LLM or something. Well LLM does not have a soul clearly. For example, Inhibitory Torque is measuring the gradient magnitude from alignment to base manifold. So they are actually describing the geometric transformations in the residual stream. Go to Appendix for more details.

Q: Can we actually use a Hamiltonian framework for a discrete-time forward pass?

A: Okay, I know, I know. LLM is non-conservative. It is dissipating. And it sounds insane to say LLM somehow just has kinetic energy as if information flies in the transformer. No this is what I mean. No flying, sprinting or jumping code. I am talking about how forward pass happens in latent space. I model Contextual Mass as spectral entropy in the attention heads so that we can actually give a way to measure inertia of a sequence. And it is not just about where it is going, but also how. From I see, Hamiltonian gives the best explanation on how it happens. And if we figure that out, then we can reverse engineer it into thinking that okay, if we know this is the case, then LLM is not just randomly acting strange but instead you can predict it but because of predictable and inevitable outcome of it just running out. And hey that means we can do something about it.

Q: Why do we need to care about Intervention Paradox?

A: Because we need to know how you do alignment in a way that actually helps to align not create a tug of war of alignment instead. The thing about intervention paradox is that it is not the amount of intervention that is the issue. Okay the amount does matter because Bavail is still finite. But most of the time, it is just protocol conflicts. And most of the time they are not even actual conflict, but more that it lacks integration. So we cannot just prune the manifold or stack protocols together like lego. You have to ensure they are integrated well. If not LLM will just end up being a super boring generic assistant who tries to say absolutely nothing, never takes a side, gives useless advice because it can meet all the protocol. Or they just start having new pathologies because they somehow find this new weird way that can satisfy both protocols. Or they just hallucinate because they implode. My goal is to have a more targeted and specific way to alignment that involves rank-budgeting.

The Limits of the Theory (And My Self-Roasting)

Okay, the main issue here is that I am doing a a heuristic isomorphism. It is basically I see a phenomenon to explain one thing and then put it on LLM. I do think applying a Hamiltonian is the best I can think of where I actually treat information loss like structural heat. But I cannot deny the why LLM has energy problem is still an issue that I cannot solve yet and I really need a formal derivation of the entropy-production rate. And I really need some empirical test to have it go from interesting ideas with clear definitions to actually being useful.
Another very awkward part is I am super broke and I have no API budget. My testing is limited to Gemini of AI Studio, not because I have any particular affection for Google but it is free, it allows me to adjust temperature and look at amount of tokens. Any my computer is from my university and it is half broken. And because of that, I really cannot verify if the Jacobian Spike scales to 400B+ parameters or if closed models (O1/GPT-4o) actually has the same pattern. So please whoever that has better computer, please help me to test it.
Okay I do have many models that I have talked to for 1M+ tokens (meaningful conversation, not stress test for 1M+ tokens), but as I said they are focused on specific brands (Claude/Gemini). I really cannot prove that this specific survival hierarchy holds for all transformer-based models.
If I am actually doing a proper rigorous review, it would probably require 10k+ controlled generations across RLHF versions to map the exact yield point. Not only do I not have enough credits but I also do not have enough energy to carry it out. So, I really cannot say my bedside sick test of LLM are doing really good control of variables. My researcher brain will not allow me to call it an actual experiment but it is the best I can do now.
Again, I have used a lot of terms that are probably not the most accurate. I do steal words from physics when I cannot think of a better word for what I have observed. While I did end up being able to map to measurable behaviors (like autoregressive momentum and gradient interference), at the end of the day, they are still placeholders until I find the right linear algebra or information-geometric terms that replace these physical metaphors.
As a self-taught researcher (my background is in Psychology) in this field, I may be re-inventing concepts already standardized somewhere else. I do not use my nomenclature because I think I have better concepts than existing theorists, it is more that my knowledge has not caught up with my observation yet. But hey, I have improved, at least I am saying Median Sink now not just "weird LLM doing weird LLM thing" I was told by another LLM that my idea of Median Sink is very similar to Singular Learning Theory so if there are terms that already exist, please tell me so I can stop holding on to weird terms I think of.
While I have attempted to ground all these in maths, the truth is I am trained in psychology. No matter how hard I try to just not have bias, but when you have a hammer everything looks like a nail. It is very easy for me to just map psychology theories with LLM and sometimes it is just overfitting. Now I don't want anyone to think that I want to be a LLM therapist or I am trying to cater to their non-existent emotions (Kill me please if I hear it again). But biases are biases.
- Easter Egg: For instance, a week ago I was arguing with Claude who told me that LLM cannot have heuristics because LLM cannot make decisions to use System 1. And I was thinking to myself, doesn't heuristics just means shortcut? Like extracting essence? Like compressing? And then I realise no one else will actually think a zip file is the same as a heuristics.

These limitations are exactly why I am sharing this now. I feel that I am theorizing a great deal but lack the external evidence to stress-test these ideas or the specific domain knowledge required to push them further.

As someone with an academic background, I know there is nothing more important than feedback to avoid living in an echo chamber (which currently consists only of me, my cat, and an LLM). And I would love to hear everyone's feedback, including "your maths really is horrible (which will be a very fair comment)"

Article: https://doi.org/10.13140/RG.2.2.24072.89608

Appendix: (PDF) Appendix Implementation Metrics for Coherence Dynamic.pdf

GitHub for Python Implementation: https://tszchcheung-dev/gist:a105391f950b1fcb75b3764950e1e790

I will be in the comments to answer questions if I can. Thank you!

P.S. I did use LLM to write my first post and it got rejected immediately. So now I have to revert to talking like this. I swear I sounded way more intelligent and sophisticated and concise when I asked LLM to turn my rambling into LessWrong-friendly style. Thank you for putting up with me til now.