Speaking for myself, dunno if this is exactly what Eliezer meant:
The general rule of thumb is that if you want to produce a secure, complex artifact (in any field, not just computer science), you accomplish this by restricting the methods of construction, not by generating an arbitrary artifact using arbitrary methods and then "securing" it later.
If you write a piece of software in a nice formal language using nice software patterns, proving its security can often be pretty easy!
But if you scoop up a binary off the internet that was not written with this in mind, and you want to prove even minimal things about it, you are gonna have a really, really bad time.[1]
So could there be methods that reliably generate "benign" [2] cognitive algorithms?[3] Yes, likely so!
But are there methods that can take 175B FP numbers generated by unknown slop methods and prove them safe? Much more doubtful.
In fact, it can often be basically completely impossible, even for simple problems!
For example, think of the Collatz Conjecture. It's an extremely simple statement about an extremely simple system that could easily pop up in a "messy" computational system... and currently we can't prove it, despite massive amounts of effort pouring into it over the years!
What is the solution? Restrict your methods so they never generate artifacts that have "generalized Collatz problems" in them!
As in, it's tractable for modern humans to prove their "safety"
Probably not encoded as 175B floating point numbers...
For example, think of the Collatz Conjecture. It's an extremely simple statement that could easily pop up in a "messy" computational system...and we totally can't prove anything about it!
This is sloppily presented and false as currently written, and in any case doesn't support the argument it's being used for.[1] As a sample illustration of "something" we can prove about it, for all sufficiently large , at least integers between and eventually reach once the algorithm is applied to them.[2]&nbs...
Thanks for pointing out my imprecise statement there! What I meant of course "is we can't prove the Collatz Conjecture" (which is a simple statement about a simple dynamic system), but I wrote something that doesn't precisely say that, so apologies for that.
The main thing I intended to convey here is that the amounts of effort going into proving simple things (including the things you have mentioned that were in fact proven!) are often extremely unintuitively high to people not familiar with this, and that this happens all over CS and math.
I found Connor’s text very helpful and illuminating!
…But yeah, I agree about sloppy wording.
Instead of “you want to prove even minimal things about it” I think he should have said “you want to prove certain important things about it”. Or actually, he could have even said “you want to have an informed guess about certain important things about it”. Maybe a better example would be “it doesn’t contain a backdoor”—it’s trivial if you’re writing the code yourself, hard for a binary blob you find on the internet. Having access to someone else’s source code helps but is not foolproof, especially at scale, (e.g.).
Well, hmm, I guess it’s tautological that if you’re writing your own code, you can reliably not put backdoors in it. There’s no such thing as an “accidental backdoor”. If it’s accidental then you would call it a “security flaw” instead. But speaking of which, it’s also true that security flaws are much easier to detect or rule out if you’re writing the code yourself than if you find a binary blob on the internet.
Or the halting problem: it’s super-easy to write code that will definitely halt, but there are at least some binary blobs for which it is impossible in practice to know or...
As I understand it, the initial Yudkowskian conception of Friendly AI research[1] was for a small, math- and science-inclined team that's been FAI-pilled to first figure out the Deep Math of reflective cognition (see the papers on Tiling Agents as an illustrative example: 1, 2). The point was to create a capability-augmenting recursive self-improvement procedure that preserves the initial goals and values hardcoded into a model (evidence: Web Archive screenshot of the SingInst webpage circa 2006). See also this:
When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion.
Then you would figure out a way to encode human values into machine code directly, compute (a rough, imperfect approximation of) humanity's CEV, and initialize a Seed AI with a ton of "hacky guardrails" (Eliezer's own term) aimed at enacting it. Initially the AI would be pretty dumb, but:
So the point is that we might not know the internals of the final version of the FAI; it might be "inscrutable." But that's ok, they said, because we'd know with the certainty of mathematical proof that its goals are nonetheless good.
From there on out, you relax, kick back, and plan the Singularity after-party.
Which will likely seem silly and wildly over-optimistic to observers in hindsight, and in my view should have seemed silly and wildy-optimistic at the time too
this was never going to work...
... without the help of an AI that is strong enough to significantly augment the proof research. which we have or nearly have now (may still be a little ways out, but no longer inconceivable). this seems like very much not a dead end, and is the sort of thing I'd expect even an AGI to think necessary in order to solve ASI alignment-to-that-AGI.
exactly what to prove might end up looking a bit different, of course.
I think the emphasis is on inscrutable. If you didn't already know how the deep learning tech tree would go, you could have figured out that Cyc-like hard-coding of "tires are black" &c. is not the way, but you might have hoped that the nature of the learning algorithm would naturally lend itself to a reasonably detailed understanding of the learned content: that the learning algorithm produces "concept" data-structures in this-and-such format which accomplish these-and-such cognitive tasks in this-and-such way, even if there are a billion concepts.
This is what I think he means:
The object-level facts are not written by or comprehensible to humans, no. What's comprehensible is the algorithm the AI agent uses to form beliefs and make decisions based on those beliefs. Yudkowsky often compares gradient descent optimizing a model to evolution optimizing brains, so he seems to think that understanding the outer optimization algorithm is separate from understanding the inner algorithms of the neural network's "mind".
I think what he imagines as a non-inscrutable AI design is something vaguely like "This module takes in sense data and uses it to generate beliefs about the world which are represented as X and updated with algorithm Y, and algorithm Z generates actions, and they're graded with a utility function represented as W, and we can prove theorems and do experiments with all these things in order to make confident claims about what the whole system will do."(The true design would be way more complicated, but still comprehensible.)
Building on what you said, pre-LLM agent foundations research appears to have made the following assumptions about what advanced AI systems would be like:
I think a natural way to think about this is ala AIXI, cleaving the system in two, prediction/world-modeling and values.
With this framing, I think most people would be fine with the world-model being inscrutable, if you can be confident the values are the right ones (and they need to be scrutable for this). I mean, for an ASI this kind of has to be the case. It will understand many thing about the world that we don't understand. But the values can be arbitrarily simple.
Kind of my hope for mechanistic interpretability is that we could find something isomorphic to a aixi (with inscrutable neural goop instead of turing machines) and then do surgery on the sensory reward part. And that this is feasible because 1) the aixi-like structure is scrutable 2) the values the AI has gotten from training probably will not be scrutable, but we can replace them with something that is, at least "structurally", scrutable.
Be a little careful with this. It's possible to make the AI do all sorts of strange things via unusual world models. Ie a paperclip maximizing AI can believe "everything you see is a simulation, but the simulators will make paperclips in the real world if you do X"
If your confident that the world model is true, I think this isn't a problem.
And so under this world model we feel doomy about any system which decides based on its values which parts of the world are salient and worth modeling in detail, and which defines its values in terms of learned bits of its world model?
Yeah, I think the hope of us fully understanding a learned system was a fool's errand, and the dream of full interpretability was never actually possible (because of the very complicated sequences of the world and the fact that indexical complexity means your brain needs to be even more complicated, and Shane Legg proved that for Turing-computable learners, you can only predict/act on complex sequences by being that complex yourself).
To be frank, I think @Shane_Legg's paper predicted a lot of the reason why MIRI's efforts didn't work, because in practice, a computable theory of learning was just way more complicated than people thought at the time, and it turned out there was no clever shortcut, and the sorts of things that are easy to white-box are also the things that we can't get because they aren't computable by a Turing Machine.
More generally, one of the flaws in hindsight of early LW work, especially before 2012-2013, was not realizing that their attempts to relax the problem by introducing hypercomputers didn't work to give us new ideas, and the relaxed problem had no relation to the real problem of making AI safe as AI progresses in this world, such that solutions for one problem fail to transfer to the other problem.
Here's your citation, @Steven Byrnes for the claim that for Turing-computable learners, you can only predict/act on complex sequences by being that complex yourself.
https://arxiv.org/abs/cs/0606070
The immediate corollary is as AI gets better, it's going to be inevitably more and more complicated by default, and it's not going to be any easier to interpret AIs, it will just get harder and harder to interpret the learned parts of the AI.
I think some of the optimism about scrutability might derive from reductionism. Like, if you've got a scrutable algorithm for maintaining a multilevel map, and you've got a scrutable model of the chemistry of a tire, you could pass through the multilevel model to find the higher-level description of the tire.
I suspect the crux here is whether or not you believe it's possible to have a "simple" model of intelligence. Intuitively, the question here is something like, "Does intelligence ultimately boil down to some kind of fancy logic? Or does it boil down to some kind of fancy linear algebra?"
The "fancy logic" view has a long history. When I started working as a programmer, my coworkers were veterans of the 80s AI boom and the following "AI winter." The key hope of those 80s expert systems was that you could encode knowledge using definitions and rules. This failed.
But the "fancy linear algebra" view pull ahead long ago. In the 90s, researchers in computational linguistics, computer vision and classification realized that linear algebra worked far better than fancy collections of rules. Many of these subfields leaped ahead. There were dissenters: Cyc continued to struggle off in a corner somewhere, and the semantic web tried to badly reinvent Prolog. The dissenters failed.
The dream of Cyc-like systems is eternal, and each new generation reinvents it. But it has systematically lost on nearly every benchmark of intelligence.
Fundamentally, real world intelligence has a number of properties:
When you have a giant pile of numbers as input, a complex system for weighing those numbers, and a probability distribution as output, then your system is inevitably something very much like a giant matrix. (In practice, it turns out you need a bunch of smaller matrices connected by non-linearities.)
Before 2022, it appeared to me that Yudkowsky was trapped in the same mirage that trapped the creators of Cyc and the Semantic Web and 80s expert systems.
But in 2025, Yudkowsky appears to believe that the current threat absolutely comes from giant inscrutable matrices. And as far as I can tell, he has become very pessimistic about any kind of robust "alignment".
Personally, this is also my viewpoint: There is almost certainly no robust version of alignment, and even "approximate alignment will come under vast strain if we develop superhuman systems with goals. So I would answer your question in the affirmative: As far as I can see, inscrutability was always inevitable.
I appreciate the clear argument as to why "fancy linear algebra" works better than "fancy logic".
And I understand why things that work better tend to get selected.
I do challenge "inevitable" though. It doesn't help us to survive.
If linear algebra probably kills everyone but logic probably doesn't, tell everyone and agree to prefer to use the thing that works worse.
the human-created source code must be defining a learning algorithm of some sort. And then that learning algorithm will figure out for itself that tires are usually black etc. Might this learning algorithm be simple and legible? Yes! But that was true for GPT-3 too
Simple first-order learning algorithms have types of patterns they recognize, and meta-learning algorithms also have types of patterns they like.
In order to make a friendly or aligned AI, we will have to have some insight into what types of patterns we are going to have it recognize, and separately what types of things it is going to like or find salient.
There was a simple calculation protocol which generated GPT-3. The part that was not simple was translating that into predicting its preferences or perceptual landscape, and hence what it would do after it was turned on. And if you can't predict how a parameter will respond to input, you can't architect it one-shot.
Thanks for writing this. This kind of pointing out statements that seem intuitive on the surface but, don't obviously hold up (or at least require more explanation), is super high value.
Imagine a system where you just type in the word "tires", and you get a list of everything the AI knows about tires, as english sentences or similar.
You can change the sentence from “tires are usually black” to “tires are usually pink”, and the AI's beliefs change accordingly.
This is, in some sense, a very scrutable AI system. And it's not obviously impossible.
Except suppose the AI went "tires are usually pink. Tires are usually dyed with amorphous carbon. Therefore amorphous carbon is pink. " and carries on like that, deducing a bizarre alternate reality version of chemistry where an electrons charge is if it's spin up, and for spin down electrons. And it somehow all works out into a self consistent system of physics. And every fact you told the AI somehow matches up. (Many many facts you didn't tell the AI are wildly different)
Suddenly it doesn't feel so scrutible.
You can change the sentence from “tires are usually black” to “tires are usually pink”, and the AI's beliefs change accordingly.
This is, in some sense, a very scrutable AI system. And it's not obviously impossible.
This isn't what Eliezer is talking about or interested in. From Dark Side Epistemology in the Sequences:
If you once tell a lie, the truth is ever after your enemy.
I have discussed the notion that lies are contagious. If you pick up a pebble from the driveway, and tell a geologist that you found it on a beach—well, do you know what a geologist knows about rocks? I don’t. But I can suspect that a water-worn pebble wouldn’t look like a droplet of frozen lava from a volcanic eruption. Do you know where the pebble in your driveway really came from? Things bear the marks of their places in a lawful universe; in that web, a lie is out of place.1
What sounds like an arbitrary truth to one mind—one that could easily be replaced by a plausible lie—might be nailed down by a dozen linkages to the eyes of greater knowledge.
Here’s a 2022 Eliezer Yudkowsky tweet:
I find this confusing.
Here’s a question: are object-level facts about the world, like “tires are usually black”, encoded directly in the human-created AGI source code?
So what is he taking about?
Here are some options!
Or something else? I dunno.
Prior related discussions on this forum: Glass box learners want to be black box (Cole Wyeth, 2025) ; “Giant (In)scrutable Matrices: (Maybe) the Best of All Possible Worlds” (1a3orn, 2023) ; “Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc” (Wentworth, 2022) (including my comment on the latter).