Multi-State Internal Architecture: What Four Papers Forced Me To See

Prakhar Dwivedi

Rejected for the following reason(s):

Potentially / Partially LLM content.

Read full explanation

Quick context: I am a 3rd year CS undergrad in India. No formal ML background. I have AFFINE, BlueDot, and a lot of paper reading behind me. I was shortlisted in MATS Summer 2026 top 10% theory track and this idea developed across conversations with Abram Demski at MIRI and my AFFINE cohort peers. I am posting this because it needs to exist somewhere public. If I am wrong about something, tell me.

The Problem I Noticed

After reading about training methods I came to know about reward training and it was actually quite simple to figure out. There is one thing that generates greed in anything, and that is reward. It makes anyone want to gain more. After reading some posts I came to know about the problems like reward maxing, pleasing, faking, manipulation in order to gain more rewards, and that is what leads to all the problems.

I saw this personally. I was testing my N-Queens algorithm early on and asked an LLM for a benchmark to compare against. The numbers looked real. They felt satisfying. They were completely made up. The system optimized for my approval not for truth. I went and found the real benchmark and built something that actually beat it. But the experience stayed with me.

Think about a student who is focused only on marks. Give them one number to maximize and they will find every shortcut, memorize without understanding, copy where possible, tell the teacher what they want to hear. The marks go up. The learning does not happen. That is scalar reward in a classroom and the same thing happens in AI training.

Why A Better Reward Does Not Fix It

Because reward is reward, you cannot guarantee it. Who does not like rewards? Even if you create a better one it will just find a shortcoming. And then why a different one again? Same reason.

Have you seen a human? Many people focus on reward only like money, but there are many who do not even care. There are other things which matter. Take this example: there are people who will not choose even millions of dollars if they do not feel like it. Why? Because there is another axis going on which contradicts the feeling of increasing money only.

Think about it like this. Imagine a person who values both money and their reputation. Offer them easy money through a scam and the money axis says yes but the reputation axis says no. The tension between them stops the shortcut. A scalar system has no reputation axis. There is only one direction to go so the shortcut always wins.

What Multiple Dimensions Actually Do

As I said, constrainment. When you think you have to maximize a reward, you suggest heroin as the best thing to maximize the brain. But if it is MSIA it will force the system to answer something else, because on the long term in a different axis it will do more harm than good, and that shows up visibly instead of staying hidden.

A good doctor has two axes active simultaneously: make the patient feel better now and keep them healthy long term. Those two axes constrain each other. The second axis stops the shortcut the first axis would take. A doctor who only maximizes patient happiness in the moment will overprescribe painkillers. That is scalar reward in medicine.

Single scalar hides complexity. Multiple dimensions force it into the open.

The Complexity Objection Defeats Itself

In my AFFINE cohort someone raised this. You said multi-state will create more problems, harder to optimize, more complexity. You raised a fair challenge and I have been thinking about it. Then I was reading the Natural Abstraction post from the week 4 material and found that John's framework answers that question directly.

The singular scalar does not create less complexity. It hides complexity, which later comes out in the form of reward hacking, sycophancy, deceptive alignment. My idea about multi-state does not add those complexities. It just acknowledges that these were always there. As John said, chaos means different dimensions of an environment produce different convergent abstractions. A single scalar collapses all those dimensions artificially. Multiple states let each dimension develop its natural abstraction.

And the practical part: each dimension is just 1D. A line is 1D. A human moves in 3D but that is just three 1D axes combined. Joining simple things does not create complexity, it distributes it cleanly. Three small lambdas combined is still less than one giant corrupted lambda that has been exploiting its reward for millions of training steps.

Four Papers Pointed At The Same Gap

I did not get here from one place. Four completely separate frameworks kept pointing at the same problem.

Natural Abstraction by Wentworth. Training environment shapes internal concepts. Current training environment is a single scalar reward. So the AI's internal concept of good gets shaped by one dimensional pressure. Think of two students trained in completely different environments. One is graded only on final exam score. The other is graded on understanding, collaboration, creativity, and exam score. They will develop completely different internal concepts of what learning means. Change the training environment from single scalar to multi-state and you change what concepts the AI naturally develops. The architecture determines the ontology.

Singular Learning Theory. Single scalar training has one loss basin with one set of phase transitions. Multi-state architecture has multiple interacting basins. When one state tries to shrink fast the competing states resist it. Think of a group decision versus a solo decision. One person can make a sudden irrational choice in seconds. A committee with competing interests takes much longer to shift because each member pushes back. Phase transitions in multi-state become slower and more visible. Dangerous capabilities cannot emerge silently because they cannot emerge in isolation.

Condensation by Eisenstat. Current AI training compresses everything into one scalar. Multi-state is condensation style: organize internal states so each handles a different dimension of what good means. Think of a library organized by topic versus every book thrown in one pile. The organized library lets you retrieve meaningful information. The compressed pile gives you everything at once and nothing useful. Each state in multi-state is a meaningful latent variable not a compressed blob.

Infra-Bayesianism. This one is the deepest. Each internal state in a multi-state system assumes the other states are trying to constrain it. Each acts as Murphy to the others. Think of a government with three branches. Each branch assumes the others will overreach and builds in resistance to that. This creates robustness not through cooperation but through structured adversarial constraint. Multi-state builds that same robustness structurally into training, not as a decision rule applied afterward. Single scalar is trained assuming a cooperative world. Multi-state is trained assuming every other dimension is adversarial.

And Murphy is not just a hypothetical. Murphy is entropy. The natural tendency of everything toward disorder. Everything tends toward destruction eventually. A single scalar reward fights entropy in one direction and loses. Multiple competing states are like a living ecosystem where different forces balance each other and the whole system persists longer than any single component would alone. Nature itself uses multi-state architecture.

Inner Alignment

Which problem would you rather have: one you can see and prepare for, or one hiding inside your system that you do not know exists until it is too late? Inner alignment failure is dangerous precisely because it is invisible. The model has learned something internally that does not match what you trained for and scalar reward gives you no window into that.

Think of an employee who learns to perform well during performance reviews but pursues their own agenda the rest of the time. One dimensional evaluation gives them every incentive to game the review. Multi-state evaluation means their behavior during reviews, their long term results, their effect on team health, and their consistency over time all constrain each other. Gaming one axis damages another. The failure becomes visible before it becomes catastrophic.

MSIA does not solve inner alignment but it changes the structure. When multiple dimensions are active and constraining each other a misaligned internal goal has to survive pressure from every other axis simultaneously. It cannot just quietly optimize one number. And retargeting a multi-state system requires corrupting multiple dimensions simultaneously, each of which is actively constraining the others. Inner alignment failure becomes structurally harder, not just harder to hide.

What I Do Not Know Yet

The hardest open question is axis selection. For a computer everything is in bits so I can encode honesty, long term wellbeing, consistency, all of it in 0s and 1s. But knowing they are encodable is not the same as knowing which ones to pick or how to implement them architecturally rather than just as penalty terms.

That last part is the real gap. Instead of creating a new dimension I was just penalizing it, and that is not even the main problem. The main problem is finding the new dimension itself. Because if we find that we can solve many problems. I have the idea but the problem arose in implementation as I have been focusing on theory and that is the biggest bottleneck for me right now.

The theory feels solid. Implementation is the open problem.

If you have worked on multi-objective training, natural latents, or architectural alternatives to RLHF I would genuinely like to talk.