The Facade of AI Safety Will Crumble

Liron

On the eve of superintelligence, real AI safety is a nonexistent field.

The AI companies have embraced something else: safety through psychoanalysis (for lack of a better term). Their safety team concocts various test scenarios for their AI in order to learn various traits of the AI’s “personality”. They even go so far as to mechanistically interpret narrow slices of an AI’s conditional behavior.

The working assumption is that ~~psychoanalysis~~ detailed mechanistic knowledge of conditional behavior can be used to extrapolate useful conclusions about what problems to expect when their AIs grow more powerful.

Embracing this paradigm has been convenient for AI researchers and their companies to feel and act like they’ve been making progress on safety. Unfortunately, while the psychoanalysis / mechanistic behavioral modeling paradigm is excellent for stringing regulators along and talking yourself into being able to sleep at night, it’ll crumble into dust when superintelligence arrives.

The real extinction-level AI safety challenge, the reason we’re nowhere close to surviving superintelligence, is something else — something AI companies decided they won’t mention anymore, because it exposes their AI safety efforts as a shockingly inadequate facade.

Ignore the whole AI psychoanalysis / behavioral modeling paradigm. It’s a distraction. Instead, focus your attention on the level separation between the nature of the work that an intelligence does, and the silly details of still-primitive 2026 intelligence implementations.

The real AI safety challenge is to know what a “superintelligent system that does what we want” would look like, before we build a superintelligent system that doesn’t do what we want. We have very little traction on this question. The only disagreement is over whether we can somehow muddle through — solving the problem incrementally before we run out of time or make a fatal mistake.

Let me explain why the real AI safety challenge isn’t like the fake one.

Ever notice how nothing in the behavior of modern computing applications has any relationship to the silly details of how transistors perform the work of boolean logic? That’s because modern computers are a relatively mature implementation of the abstract dynamic of computation. The level separation is strong with them.

On the other hand, computers made of vacuum tubes, relays, or biological neurons are primitive enough implementations of computation that their silliness does leak into their high-level behavior. The level separation is weak with them.

Consider a human brain, a deeply unserious implementation of a universal computing platform. You’re telling me it’s a computer made out of living cells and chemical neurotransmitters? This bizarre biopunk computing engine in our skull is a toothpick-and-marshmallow structure compared to modern computing electronics.

It would be incorrect to treat brains as a model from which to extrapolate properties of serious computers. Instead, we must see brains as a lower bound for what’s possible in the category of physical implementations of computers.

The brain’s status as a “minimum viable computer” and “minimum viable intelligence” is why the level separations between hardware and psychoanalysis, and between psychoanalysis and intelligence (outcome-steering power), are weak and penetrable with us. It’s why psychoanalysis, and mechanistic knowledge of conditional behavior, are useful tools in the effort to predict human behavior, as in the effort to predict what the next couple generations of still-near-MVP AIs will do.

It would be wrong to dismiss a psychologist by saying, “look, humans just do what proper algorithms on proper computers would do”, because the unserious specs of the brain-as-computer implementation makes the abstraction too leaky:

Short-term memory: 1kb
Max serial steps per second: 20
Max continuous session length: 40 hours
Max accelerated vector geometry dimensions: 3
Native floating-point representation width: 5 bits
Max causal nodes in native utilitarian calculations: 7

But when implementations get better, abstractions stop leaking.

Studying the lower bound of what lets a species of unserious computers get by in the world is fascinating; there’s much to learn about the floor. But the floor is not all there is. If you only study the floor, you don’t notice when things get up and fly.

The narrow-minded analysis of today’s Claude, what passes in the world’s perceived top safety organization as the bread&butter of their non-extinction strategy, is an exercise in turning up lovably quirky implementation details — the equivalent of “lol we got this human subject to select a choice implying that they have a stronger preference for 300 birds than for 10^6 people because the quirky little guy systematically stores quantities as neural weights in such a way as to forget their exponents 😂”.

These psychoanalysis-type results will be utterly irrelevant in the rapidly-approaching world where we advance from the analogue of vacuum tube computers to the analogue of transistor computers on printed circuit boards — the world where a mature implementation of intelligence is on the scene, mature in the sense of achieving decidedly superhuman performance on real-world tests of outcome steering.

We are heading into a world where the level-separation barrier between the abstraction describing the work that the AI does (powerful general goal-to-action mapping) and the implementation details of that abstraction (whatever architectural elements the AI companies throw into the stew to unlock the next tier of performance in the near future) will become strong.

As we have ascended to a degree of seriousness and maturity in the implementation our computing engines (and various other types of engines for that matter), we will also ascend to seriousness and maturity in the implementation of outcome-steering engines.

The outcome-steering effect of future AIs will lose any discernable “flavor” or “personality” leaking from any details of its implementation that used to relate to why the implementation wasn’t robustly superhuman. Even attempts to probe the AI’s “mechanistically interpretable conditional behavior” will offer no predictive power beyond modeling the AI as an outcome-steering engine.

We don’t try to predict what a modern computer will do by asking any question whatsoever about the behavior of electricity in its transistors, even though this was a standard thing engineers did in the early days of computing. Because the level separation came.

The level separation is coming.

But no one is talking as if it is. No one is talking about the day when mature outcome-optimizers arrive, the day when the silly quirks that made the abstractions leak in the older Claudes lose their predictive power, and all we’ve got to work with is the smooth surface of the abstraction:

They steer outcomes better than humans.

This is what’s coming in a single-digit number of years. But the vast majority of AI safety research is psychoanalysis. None of it will help.

And there is no Plan B. No one is working on the problem of “how do you stay in control of something that steers outcomes better than you do” at the only robust level of abstraction. The psychoanalysis is the plan.

It’s an excellent facade. It lets researchers, executives, and policymakers feel relieved that the “AI safety plan” box is checked. It just doesn’t do anything when the superintelligence comes.

This post's examples are not very convincing, but I agree with the premise. The trouble is explaining why in a way that is clearly wrong-if-the-premise-is-wrong.

I think the main claim is that there is some "phase change" that will happen between AI today and AGI, rendering current mechinterp and control systems moot since post-change AI will have different affordances, tendencies and behaviours (in the same way that grown up humans are no longer surprised and delighted by peek-a-boo or similar object-permanence games which can elicit reliable reactions from babies). I do agree with the claim but the examples/elaboration given here don't feel super substantiated.

Oh well put! I think the post also missed pointing out why iteratively working on existing models will not work. If you can please a baby, and keep working with said baby as it grows up, you'll likely be able to please the adult version of the baby too.

The analogy breaks down if the baby grows up overnight and you don't have time to adapt, but we're not working just by ourselves - the ever improving models themselves are used in alignment and related work. E.g. judge models and model generated evals for evaluations, with the extreme end of it being deferring to aligned AIs to build future aligned AIs.

The post is about what "adulthood" means for goal engines, and where the vector from baby to adulthood points. Current AI safety work is only relevant to a "system that is still sufficiently baby-like". But we should expect goal engines to be extremely mature. When you are negotiating with a human adult who is trying to maximize their company's profit, there is no need to study the phenotype of the 3-month embryo that once scaffolded that human.

I think this post is counterproductive. There are serious reasons to believe why iterative alignment would fail, and serious reasons to believe that it's the best thing we can work on right now. But this post reads like 30% vague ideas and 70% condescension. It feels like it's written to score social points rather than put forth good ideas in earnest discussion.

The idea explained in the post, in a way that I don't know what other reference already explains, is that there is a disconnect between the expected character of a mature goal engine, and the nature of the tools that are being developed under the name "AI safety".

real AI safety is a nonexistent field.

trying to fix this. If you wanna help us promote (once ready, need to prep docs) or visit that would be cool.

(also superintelligence alignment is not totally entirely nonexistent as a field, i think there's at least a few dozen people directly trying to tackle the main bottleneck, plus a bunch more banking on automated alignment which is super doomy if you don't understand alignment enough to direct your research system but still attempting a thing)

I took the liberty to exaggerate "a 2-digit number of people" as a "nonexistent field" :)

The outcome-steering effect of future AIs will lose any discernable “flavor” or “personality” leaking from any details of its implementation that used to relate to why the implementation wasn’t robustly superhuman. Even attempts to probe the AI’s “mechanistically interpretable conditional behavior” will offer no predictive power beyond modeling the AI as an outcome-steering engine.

This isn't a good argument. For one thing, humans still have personalities if they could think at 1000x speed. Now it's true that Stockfish has less personality than human chess players because it's optimized for the single goal of winning a chess game. But even if we assume superintelligences have fixed goals like this, conditional behavior will still clue us clues to what their goals are, just like if we ran a version of Stockfish that wanted to never lose its queen, we would observe it to be extremely protective of its queen.

If alignment science keeps up with capability advances, observing whether GPT-8 prefers to save birds or humans could give us more help in steering it because we know the goals are consistent rather than quirks. There will be challenges in keeping up, both in understanding GPT-8 and getting it to do what we want, but OP hasn't explained at all how these relate to dealing with a "level separation".

Humans at 1000x speed still retain many properties of immature goal engines, full of abstraction-breaking silly quirks, the same way ENIAC at 1000x speed can still get a literal bug (like a moth) in it. The direction of progress after ENIAC did not point toward ENIAC at 1000x speed.

P.S. Thanks for being the only one so far to engage with my claim.

Trying to get researchers to extrapolate even a little beyond what current systems are capable of, much less beyond what the smartest humans are capable of, is like pulling teeth.

What if it were faster? What if it were smarter? What if that specific quirky failure mode didn't hold? What if these changes were substantial enough that the emergent entity behaved in a radically different manner? Had radically different affordances?

Too much to think about. A helpless shrug. A flurry of weak arguments for why we don't have to examine this possibility, pulled like wool over their eyes, hiding the flash of fear. Fear if you're lucky, if you get past the dismissal for even an instant.

I'm thinking about the future, but it seems very very few other people are.

This post's examples are not very convincing, but I agree with the premise. The trouble is explaining why in a way that is clearly wrong-if-the-premise-is-wrong.

real AI safety is a nonexistent field.

trying to fix this. If you wanna help us promote (once ready, need to prep docs) or visit that would be cool.

I took the liberty to exaggerate "a 2-digit number of people" as a "nonexistent field" :)

The outcome-steering effect of future AIs will lose any discernable “flavor” or “personality” leaking from any details of its implementation that used to relate to why the implementation wasn’t robustly superhuman. Even attempts to probe the AI’s “mechanistically interpretable conditional behavior” will offer no predictive power beyond modeling the AI as an outcome-steering engine.

P.S. Thanks for being the only one so far to engage with my claim.

Trying to get researchers to extrapolate even a little beyond what current systems are capable of, much less beyond what the smartest humans are capable of, is like pulling teeth.

I'm thinking about the future, but it seems very very few other people are.

LESSWRONG
LW

LESSWRONG
LW

35

The Facade of AI Safety Will Crumble

35

35

35