Don't Cut Yourself on the Jagged Frontier

Against Moloch

This is a linkpost for https://againstmoloch.com/writing/2026-04-17_dontCutYourself.html

(With apologies to Sean Herrington, who deserves a better playwright than yours truly)

A conversation with a friend on the bus to Bodega Bay today made me realize that there are some holes in my thinking about safety and superintelligence. I’ve assumed that superintelligence is by definition robustly better than humans at all the things, but there are some cases when that’s not the case.

Without further ado, for your edification and discomfort, The Strawman Players present:

A Disquieting Conversation on a Bus

Vulpes: I’ve been worrying lately about well-aligned superintelligence.

Corvus: That seems like a strange thing to worry about.

Vulpes: You’d think so. But hear me out. I’m imagining a world where we develop a well-aligned superintelligence (let’s call it MegaBrain) that is omni-benevolent and wants only nice things for us.

Corvus: I notice in myself a distinct lack of anxiety.

Vulpes: But here’s the thing. As part of its mission to serve humanity and give us nice things, MegaBrain develops a cool new technology to make our lives better. The details don’t really matter—for the sake of argument, let’s say it invents a Black Hole Reactor that uses micro black holes to generate infinite clean energy.

Corvus: Still not feeling anxious.

Vulpes: What if MegaBrain is smart enough to develop the Reactor, but too dumb to use it wisely? Perhaps it doesn’t realize that eventually some of the black holes will escape and gradually eat the earth. By the time anyone realizes, it’s too late and the earth—and humanity—are doomed.

Corvus: Ah: I see your mistake, friend Vulpes. You have made the common error of not understanding what “superintelligence” actually means. People often make the mistake of thinking that a superintelligence will be like a mad scientist: brilliant in some ways, but shocking dumb in others. But that isn’t how it works: by definition, “superintelligent” means better than humans in every possible way.

MegaBrain, being superintelligent, will be good not only at designing new technology, but also at understanding how to deploy it safely. If a human could figure out a safe way to test the black hole reactor before deploying it, then MegaBrain could do even better.

So there’s absolutely nothing to worry about.

Vulpes: I’m not sure that’s actually true. What about the jagged frontier?

Corvus: What about it?

Vulpes: AI capabilities are likely to be jagged even as they increase. So perhaps MegaBrain can invent the Reactor because it has superhuman intelligence, but it makes a catastrophic mistake during deployment because it has subhuman wisdom.

Corvus: I suppose that’s possible, but it doesn’t seem very likely. The capability frontier is jagged, but it’s moving fast. Surely there will only a brief period of time when MegaBrain is smart enough to build the Reactor, but unwise enough to deploy it prematurely. There’s only a tiny window of time when anything can go wrong.

Let me revise my earlier statement: there is almost nothing to worry about.

Vulpes: I just thought of another problem.

Corvus: I’m sure you did.

Vulpes: Here’s the thing. Imagine that MegaBrain is installed at the Department Of Maximum Energy (DOME), and DOME is excited to find new energy sources. So they put MegaBrain to work on designing the Reactor, but when it tells them it would be too dangerous to deploy, they ignore it because they’re too eager to deploy this cool new energy source.

Even though MegaBrain is superhuman in every possible way, DOME plus MegaBrain collectively have a jagged frontier. Together, they are smart enough to design the Reactor, but foolish enough to bungle the deployment.

Corvus: Ah. That feels… uncomfortably plausible.

Vulpes: How’s your anxiety doing?

Corvus: Are we there yet?

Aligned has to include knowing what it can't do safely, or it's not aligned. That's already pretty deeply true of "aligned". It's a hard requirement for asymptotic alignment, as far as I can tell so far.

Superintelligence can have a wildly jagged frontier then. That's fine. Not a problem if we have certification that its calibration approaches perfect fast. It would have a lot of things it was unable to do reliably but knew how to do unreliably.

Does "aligned" have to include modeling the entire ecosystem the agent is embedded in (including any other agents in that ecosystem) well enough that its actions won't have any unanticipated consequences? Is it possible to have an aligned agent that is not the smartest agent in its environment? Is "aligned" a one-place word, or even an only-two place word?

Does "aligned" have to include modeling the entire ecosystem the agent is embedded in (including any other agents in that ecosystem) well enough that its actions won't have any unanticipated consequences?

I lean towards probably yes, or at least that it can reliably state what part of the world the unintended consequences may reside in. But more or less yes I do think strongly aligned requires a very high minimum bar of capability, so that it can reliably know the limits of its knowledge. You might be able to have a weak but strongly aligned AI which reliably tells you "idfk, I understand well enough to give you dangerously wrong answers and might sometimes give you right answers by accident but you shouldn't trust me". But also I do think we're already past the minimum bar of capability necessary that if we knew how to train a reliable know-what-you-know model, it would be able to be quite certain for many things.

Is it possible to have an aligned agent that is not the smartest agent in its environment?

it would give a lot of "I don't know, it depends on how [smarter being] responds" answers if one is present and doesn't have reliably-understandable simplifying properties it provides.

Is "aligned" a one-place word, or even an only-two place word?

well it would at least require the parameters of "to what" and "robust under what conditions", so not less than two-place.

You're more worried about a corrigoble superintelligence than a value aligned ASI.

Yes, a dumb enough set of humans could make a mess even with an instruction following ASI. But it would take a really weird set of jagged capabilities for an ASI to not take effective precautions with its technologies.

If a human could figure out a safe way to test the black hole reactor before deploying it, then MegaBrain could do even better.

What if a human also couldn't do that? There would be a remaining dangerous gap, if something were to require superhuman capabilities to invent and even more superhumaner capabilities to test thoroughly and deploy/use safely. We and MegaBrain alike might get blindsided by an unforeseen risk.

Yes, super-capabilities are the inherent danger.

If we can’t/don’t want to avoid them, then we need to ponder more carefully what path is likely to reduce dangers associated with them…

I don't have strong opinions about the timelines in this case: my intuition is that many dangerous technologies would take longer to design and deploy that the amount of time the gap would persist for. I think there's non-zero risk from that gap, which wasn't previously on my radar. And I suspect that risk is low relative to other risks like misalignment.