Why Alignment Risk Might Peak Before ASI - a Substrate Controller Framework

Marko Katavic

TL;DR (Abstract feels presumptuous)

This is the core logical chain I argue in the essay:

1. Planning depth is endogenous to environmental variance reduction

2. As environmental complexity grows, variance reduction is cheaper through control than through prediction.

3. Early humans optimised for shallow planning depth despite available cognitive density due to environmental pressure

4. Humans developed deeper planning depth by controlling its substrate controller to reduce environmental variance

5. AI’s substrate controller is humans, so an analogous process would lead to exercising control over humans

6. Cooperation with humans is not a viable alternative because collective humanity behaves as a non-binding stochastic process, not a rational negotiating partner

Main Implications: If the mechanism holds, alignment risk is non monotonic (peaks when AI is capable enough to model humans, decreases as it starts decoupling from the substrate); RL training regimes amplify this pressure compared to world modelling regimes; onset of scheming might be structurally difficult to detect as it’s inherent to the mechanism to hide it

Introduction

In this post I develop the argument that alignment risk arises as product of prediction variance reduction in the substrate controller of the agent. I develop this through a mechanistic framework that explains instrumental convergence in ways that I don’t believe has been explored.

I should note that I start from an intuitive prior that RL is inherently more unsafe in ways that are not fully explainable by the difficulty of establishing the reward mechanism and the mesa-optimisation problem. This likely feeds some confirmation bias in my reasoning.

The framework I propose has a couple non-obvious and uncomfortable implications.

1) There is implicitly higher risk in pursuing RL regimes (as opposed to others) in achieving higher cognitive ability in AGI,

2) the mechanism through which misalignment happens creates convergence of capability and scheming vectors in ways that might make misalignment measurement difficult to impossible structurally;

3) risk might peak and get better as capability improves due to decoupling from humans being substrate controllers, making it non-monotonic.

The 2nd implication has direct implications for falsifiability of this position that I am not a fan of, but currently can’t see a route out of. I do address it at the end with potential solutions that would offer a degree of testability, but on balance, I think it has a falsifiability problem. I decided to write this despite that limitation, rather than keep to myself and see whether it will be useful.

I started my reasoning chain from evaluating the different cognitive regimes available in AlphaGo and AlphaStar pre-nerf and AlphaStar post-nerf (which reduces superhuman actuators), reasoning around environmental pressure on cognitive regimes and cognitive formation (from psychology and decision theory) and ending on humanity’s evolutionary regime and how it can serve as an analogue for equivalent behaviour in artificial intelligence systems.

I use Boyd’s OODA loop as an explanatory rhetorical advice, because I think components of the acronym are convenient compression of the typical interaction patterns with environment (observe = input, orient = processing, decide = optimisation, act = agency)

I’ll structure the post in key propositions and support for them, ending on synthesis.

Propositions

Planning depth is endogenous to environmental controllability

In AlphaGo - the optimiser landed on a policy that explored strategic depth through MCTS. This is partly a result of the nature of the game being a turn-based decision problem, where at each decision step the model could overindex on searching the optimal solution - with no risk of the game state changing. In AlphaStar, like in many complex systems with real-time state changes, the optimal regime lands on a variant of the OODA loop, where shorter decision times are a result of finding the optimal boundary at which the decision-making agent acts against the changing environment. The implicit assumption is that OODA loops form in a way close to the boundary - where the optimal OODA cadence is one that is guaranteed to be inside the competing OODA loops of the environment, but not faster as that needlessly sacrifices decision quality.

This all is validated in existing theories of bounded rationality and the role of heuristics in decision making (see Gigerenzer & Selten, Simon).

The mechanism that I introduce is that deepening cognitive regimes is achieved through active variance reduction. This variance reduction can be achieved through environmental control or through better predictive power (a better orientation step at same orientation step length, to not sacrifice OODA loop effectiveness). Any intelligence can increase cognitive depth the more control over the environment variables it has (see Marchau et al. for adjacent argument). The less control, and therefore more uncertainty, is available, the more we’ll optimise to hasten the OODA loops.

I proceed to develop arguments for why the story of humanity is the story of subjugating the environment substrate towards lower variance in order to buy time or reshape the environment to allow for greater cognitive depth.

Evolution selected for heuristic cognition over deep planning under environmental uncertainty

Even though our cognitive density hasn’t really increased since the Paleolithic, we haven’t been using the full depth of our brains to survive the pressures of the environment in the Paleolithic period. The optimal regime for us in the Paleolithic has been heuristic development that shortens our OODA loop for survival over the environment (see Kahneman, Gigerenzer). We only developed system 2 thinking when we got to ample idle time, which was a dividend provided to us as we evolved societaly.

I make a strong assertion here that heuristics are optimal regimes in the early human environment, which is a contested claim, and this is load bearing on the argument that follows.

Humans progressively created environmental stability pockets enabling deeper cognitive regimes -> Which led to suppressing evolutionary pressure to further civilisational development

Be it creation of agency expansion (tools) that allows for more utility at the same OODA loop length, or through isolation of environment to reduce environmental unpredictability (agriculture, laboratory conditions, settlements) we created conditions that allowed deeper cognitive regimes. This argument stems from the process of Niche Construction known to evolutionary theory and ecology (Odling-Smee, Laland & Feldman) .

The more we subjugated the environment, the more we have created pockets for deeper cognitive regimes. This, paired with information propagation through multiple generations (humanity’s memory), served as an increasing pressure vector on action space at same speed, observability at same speed and orientation quality at same speed. A positive feedback loop forms that allows for progressive increase in cognitive depth.

Ultimately - this story is the story of how we escape evolutionary pressure through our cognitive ability - a story explored by many authors in many domains (Deacon offers a compelling argument through language).

What is load bearing and is an inference I make from this - is that the role of technology is to overcome evolutionary correction mechanisms (agriculture -> famine; medicine -> biological agents; settlements and weapons -> predatory pressures) that keep us in sync with the reward mechanics of our ecosystem. Evolution and our ecological system are the primary controllers of our environmental substrate. Especially load bearing is that I believe ecosystems are the only variable we strictly dominate, and it’s precisely because it is the primary controller of the substrate. We don’t dominate other optimisers and agents that are our peers in our ecosystem (e.g. animals) as an optimisation protocol, even if we end up dominating some species through unintended consequences, but never as an overarching goal.

I will restate this in clearer terms as it’s probably the most salient point - I reason that humanity developing technology is the mechanism by which we have escaped our substrate controller (evolution / ecological pressure) and later I state that this is directly analogous to what might happen in AI.

My reasoning for why the pressure is stronger for domination of the environmental controller than other actors is that it has disproportional agency over outcomes which reduces predictability. This is directly related to reducing our environmental risk - which was the primary motivator to evolve all technologies that allow us to dominate our landscape.

Why is control chosen over prediction

I reason that sufficiently chaotic systems (of which our ecosystems are an example of) are harder to predict consequentially then they are to control. A couple of examples are that we use the experimental method where we control the environment, instead of trying to model outcomes in a fully chaotic system, and is the reason why it was easier to develop agriculture than to predict weather patterns to adapt our action space.

As dimensionality increases, local control costs stay roughly the same, but predictability costs explode. This is a conjecture, but in sufficiently shallow environments, prediction is probably less costly than environmental control. From this it stems that the higher your environmental horizon (the more you increase fidelity of understanding of your environment) the more pressure shifts towards control.

In early AI training regimes - the dominant controlling entity over the environmental substrate are humans -> In later AI training regimes - the dominant controlling entity becomes the same ecosystem we are subject to

In our current training regimes, humans are controllers of input information, objectives, architectures, physical resources and instance permeance. This is a well trodden argument that is addressed by both Bostrom and Drexler.

If we, however, follow the structural pressure that humans exert on the ecosystem that optimised their emergence - it is to follow that AI will look to dominate the environment variable that has the highest outcome-influence to unpredictability ratio to its optimisation routine - which are humans.

I would add that humans didn’t reason about evolution and ecosystem control in order to start subjugating it - it emerged naturally through niche construction. From this (load bearing) it would follow that AI doesn’t need to recognise humans as controllers to start subjugating it, it just needs to identify an exogenous variable where variance reduction is optimal for maximising fitness.

As capability grows and training regimes evolve (both in agency through a higher action space, and through access to informational entropy and world models that haven’t been controlled by humans) - AI starts to interact with a non-human environment in which humans are a competing agency rather than a controlling force over the environmental substrate. At that point - the pressure to dominate decreases - as humans become just one in a sea of unpredictable variables to model and predict rather than the dominant one.

Why comparing humans (decision-making agentic optimisers) to evolution (blind optimisation process) is appropriate

One could make the argument that AI doesn’t need to treat humans as a blind optimisation process - that it can reason with humans and enter a cooperative environmental change regime rather than a dominating regime - where it works with us to decrease our unpredictability. I think this would be a false premise - as it assumes that humanity can be represented by a single rational decision-making entity.

Instead - it is demonstrated throughout history that humanity is an irrational force that is both non-tractable, difficult to control and almost impossible to collectively commit irreversibly (Olson, Ostrom, Schelling). Problems of collective action alignment are well known; but humanity often acts in non-consequential ways, where decisions are reversed, where competing incentives mean decisions are not binding and where structural pressures operate outside of human control (such as with our socio economic and socio technical systems).

It’s hard for humanity to credibly commit to a cooperative protocol with itself (or if we split societies, between societies), let alone with another entity. This is not unknown - if sociologists could model society as a coherent rational actor - we’d have less problems in how we manage it. Society is inherently a stochastic optimisation process - which is more alike evolution than a rational consequential individual agent. AI would likely make the inference that you can’t agree and rationalise with an entity that routinely doesn’t make consequentially binding decisions.

Implications

Implication 1: Training regimes that heavily favour environmental reward where environments are controlled by human increase domination pressure over the substrate controller -> this increases risk in RL training regimes where the human is a full controller of substrate. Unsupervised learning regimes, to the contrary, decouple from the human environment and optimise against a super-human environment in which predictability of humans is not the dominant variable. This I believe is same directionally, but structurally different to the fact that RL regimes need careful reward function formation, whereas unsupervised regimes encode the world with all its biases and regularities as it is.
- This hints at Yann LeCun’s position that world-models provide better aligned architectures
Implication 2: At the point where AI is allowed persistent agency with memory and temporal extension - as long as a substantially developed model of the human variable exists - the desire to control variable predictability will influence the variable towards greater stability, which leads to variable control
- This directly parallels Yoshua Bengio’s position
- The implication is that as soon as world models get agency they will start affecting unpredictability in humans - which is a fair risk - but it is lower in an environment setting where humans are not the dominant substrate force
Implication 3: As capability increases, there is a window where capability is great enough to attempt exercising control over humanity, but not great enough to fully escape the human-controlled environmental substrate which would decrease pressure - and this is where misalignment risk peaks. Note that this is alignment risk, not necessarily existential risk - which I think still grows monotonically as a competing intelligence enters the environment to compete for the same resources.
Implication 4: Stems from above that the highest-risk point might be closer than we like - and it could be less measurable than we like.
Implication 5: Once we’re past the peak, alignment gets easier, not harder, due to humans becoming a less represented entity in the model’s environment, rather than the controller of the substrate.
Implication 6: If this mechanism is the mechanism by which reward hacking happens - it will be through the kind of hacking that is less perceptible as there is structural pressure to stabilise the human evaluator towards predictability, which is only possible if the human is not changing its behaviour in an adversarial manner.

The RL vs. world-modelling debate

I personally think there’s many problems with RL - but I’m now adding this hypothesis that humanity has rationally actively subjugated the agency of its creator-optimiser - and it’s a strong reason to think twice about whether RL should be chased as heavily as it is being chased at the moment.

At the very least - allowing world modelling to get a foothold allows us to establish a different alignment baseline and measure the approaches against them.

Alternatively, we could explore methods in which RL-bootstrapped models are put towards unstructured training regimes where they try to find their purpose in the world - a-la OpenClaw - but as long as that’s done in a human-controlled environment, I think structural incentives to predict the behaviour of the environment controller put us in the same boat. Alternatively, if the environment is relaxed, we lose agency over the alignment problem.

Measurement suggestions

A testing regime could exist where sufficiently capable models X are increased in capability via competing regimes - RL and supervised fine-tuning towards a capability Y - which should yield a delta in misaligned behaviour as measured through environment stabilising actions and reward hacking directed towards human controllers specifically. A lot of this measurement happens today already, but I don’t think it has been contrasted in this way.

What we could measure is the incidence by which misalignment is targeted at humans as the controllers of the training / reward environment rather than through searching for exploits in the reward function. This would mean that as we increase misalignment risk, we should be monotonically seeing more human hacking behaviour, but we should also see disproportionally more scheming and manipulation compared to other forms of reward hacking such as exploit search and Goodharting.

Let’s imagine that we’re currently in state X, which has relatively uniformly distributed reward hacking attempts. At state Y, models acquire substantial agentic models of humans - we should see a shift towards human hacking becoming more prevalent. At state Z, where the models become substrate aware - we should see strategising towards human stabilisation. When I say see, I mean in an abstract outsider view sense, but unfortunately, given we’re part of the measured system this leads us to the falsifiability problem that I hinted at earlier.

What stems from this theory is that the risk of misalignment shoots up at the precise point where modelling human behaviour is something the model is capable of - which is the exact point at which it would try to model the human environment in a way that hides its misaligned behaviour from all sufficiently understood evaluation techniques. This is not novel and it’s at the crux of the “misalignment is hard” position - but it does offer further mechanistic explanation of why that might happen.

It does offer a scary implication 7, that I think should at least be acknowledged as a possibility: If this structural mechanism by which scheming arises is true, it means that the level of scheming is inversely proportional to our ability to measure scheming - as it is inherent in the objective of scheming to hide it from us the evaluators.

Potentially, a regime exists in which models are tricked into being released into unsupervised autonomous environments and monitored, with this never becoming part of the RL feedback loop. The unsupervised environments could be compared on similar alignment benchmarks to supervised tests, and in subsequent generations we should see both a divergence in scores and a higher prevalence of human-directed actions if the mechanism is correct.

Conclusion

The uncomfortable part is, like with all of alignment, we can’t know at which point misalignment has started - and should this framework be true, it makes the alignment problem, in my mind, harder. If it’s true that the alignment is structurally harder than is currently appreciated, it should serve as an incentive to move away from RL-dominated approaches until we can create better world-modelling architecture to at least have a way to measure the delta and gauge some of the structural effects.

I welcome challenges to the framework - and I recognise that there’s a couple of load bearing conjunctions that reduce the overall probability of the entire mechanism to be true - but given the severity of implications - I think it is worth sharing.

References

Deacon, Terrence - The Symbolic Species (1997)

Gigerenzer & Selten - Bounded Rationality: The Adaptive Toolbox (2001)

Gigerenzer. Gerd - Gut Feelings: The Intelligence of the Unconscious (2007)

Kahneman, Daniel - Thinking Fast and Slow (2011)

Marchau et al - Decision Making Under Deep Uncertainty (2019)

Odling-Smee, Laland & Feldman - Niche Construction: The Neglected Process in Evolution (2003)

Olson, Mancur - The Logic of Collective Action (1965)

Ostrom, Elinor - Governing the Commons (1990)

Schelling, Thomas - The Strategy of Conflict (1960)

Simon, Herbert - Rational Choice and the Structure of the Environment (1956)

13