New Answer

New Comment

9 Answers sorted by
top scoring

Aug 08, 2025*

12879

Speaking for myself, dunno if this is exactly what Eliezer meant:

The general rule of thumb is that if you want to produce a secure, complex artifact (in any field, not just computer science), you accomplish this by restricting the methods of construction, not by generating an arbitrary artifact using arbitrary methods and then "securing" it later.

If you write a piece of software in a nice formal language using nice software patterns, proving its security can often be pretty easy!

But if you scoop up a binary off the internet that was not written with this in mind, and you want to prove even minimal things about it, you are gonna have a really, really bad time.^[1]

So could there be methods that reliably generate "benign" ^[2] cognitive algorithms?^[3] Yes, likely so!

But are there methods that can take 175B FP numbers generated by unknown slop methods and prove them safe? Much more doubtful.

^{^}
In fact, it can often be basically completely impossible, even for simple problems!
For example, think of the Collatz Conjecture. It's an extremely simple statement about an extremely simple system that could easily pop up in a "messy" computational system... and currently we can't prove it, despite massive amounts of effort pouring into it over the years!
What is the solution? Restrict your methods so they never generate artifacts that have "generalized Collatz problems" in them!
^{^}
As in, it's tractable for modern humans to prove their "safety"
^{^}
Probably not encoded as 175B floating point numbers...

[-][anonymous]3mo70

For example, think of the Collatz Conjecture. It's an extremely simple statement that could easily pop up in a "messy" computational system...and we totally can't prove anything about it!

This is sloppily presented and false as currently written, and in any case doesn't support the argument it's being used for.^[1] As a sample illustration of "something" we can prove about it, for all sufficiently large $n$ , at least $n^{0.84}$ integers between $1$ and $n$ eventually reach $1$ once the algorithm is applied to them.^[2]&nbs... (read more)

[-]Connor Leahy3mo162

Thanks for pointing out my imprecise statement there! What I meant of course "is we can't prove the Collatz Conjecture" (which is a simple statement about a simple dynamic system), but I wrote something that doesn't precisely say that, so apologies for that.

The main thing I intended to convey here is that the amounts of effort going into proving simple things (including the things you have mentioned that were in fact proven!) are often extremely unintuitively high to people not familiar with this, and that this happens all over CS and math.

[-]Steven Byrnes3mo127

I found Connor’s text very helpful and illuminating!

…But yeah, I agree about sloppy wording.

Instead of “you want to prove even minimal things about it” I think he should have said “you want to prove certain important things about it”. Or actually, he could have even said “you want to have an informed guess about certain important things about it”. Maybe a better example would be “it doesn’t contain a backdoor”—it’s trivial if you’re writing the code yourself, hard for a binary blob you find on the internet. Having access to someone else’s source code helps but is not foolproof, especially at scale, (e.g.).

Well, hmm, I guess it’s tautological that if you’re writing your own code, you can reliably not put backdoors in it. There’s no such thing as an “accidental backdoor”. If it’s accidental then you would call it a “security flaw” instead. But speaking of which, it’s also true that security flaws are much easier to detect or rule out if you’re writing the code yourself than if you find a binary blob on the internet.

Or the halting problem: it’s super-easy to write code that will definitely halt, but there are at least some binary blobs for which it is impossible in practice to know or... (read more)

1qbolec3mo

There is such thing as accidental backdoor: not properly escaping strings embeded in other strings, like SQL injection, or prompt injection

2Steven Byrnes3mo

My impression is that if you walk of to a security researcher and say “hey what do you call the kind of thing where for example you’re not properly escaping strings embedded in other strings, like SQL injection?”, they probably wouldn’t say “oh that thing is called an accidental backdoor”, rather they would say “oh that thing is called a security vulnerability”. (This is purely a terminology discussion, I’m sure we agree about how SQL injection works.)

3qbolec3mo

I guess "backdoor" suggests access being exclusive to person who planted it, while "vulnerability" is something exploitable by everyone? Also, after thinking a bit more about it, I think you're right that "backdoor" implies some intentionality, and perhaps accidental backdoor is an oxymoron.

7Gunnar_Zarncke3mo

And in particular, Collatz has been confirmed for all numbers up to 2^71. So if it turns up in a context of 64-bit integers, we know it holds.

Aug 07, 2025

498

As I understand it, the initial Yudkowskian conception of Friendly AI research^[1] was for a small, math- and science-inclined team that's been FAI-pilled to first figure out the Deep Math of reflective cognition (see the papers on Tiling Agents as an illustrative example: 1, 2). The point was to create a capability-augmenting recursive self-improvement procedure that preserves the initial goals and values hardcoded into a model (evidence: Web Archive screenshot of the SingInst webpage circa 2006). See also this:

When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion.

Then you would figure out a way to encode human values into machine code directly, compute (a rough, imperfect approximation of) humanity's CEV, and initialize a Seed AI with a ton of "hacky guardrails" (Eliezer's own term) aimed at enacting it. Initially the AI would be pretty dumb, but:

we would know precisely what it's trying to do, because we would have hardcoded its desires directly.
we would know precisely how it would develop, because our Deep Mathematical Knowledge about agency and self-improvement would have resulted in clear mathematical proofs of how it will preserve its goals (and thus its Friendliness) as it self-improved.
the hacky guardrails would ensure nothing breaks at the beginning, and as the model got better and its beliefs/actions/desires coherentized, the problems with the approximation of CEV would go away.

So the point is that we might not know the internals of the final version of the FAI; it might be "inscrutable." But that's ok, they said, because we'd know with the certainty of mathematical proof that its goals are nonetheless good.

From there on out, you relax, kick back, and plan the Singularity after-party.

^{^}
Which will likely seem silly and wildly over-optimistic to observers in hindsight, and in my view should have seemed silly and wildy-optimistic at the time too

[-]the gears to ascension3mo31

this was never going to work...

... without the help of an AI that is strong enough to significantly augment the proof research. which we have or nearly have now (may still be a little ways out, but no longer inconceivable). this seems like very much not a dead end, and is the sort of thing I'd expect even an AGI to think necessary in order to solve ASI alignment-to-that-AGI.

exactly what to prove might end up looking a bit different, of course.

5Garrett Baker3mo

Why do you think it was never going to work? Even if you think humans aren’t smart enough, intelligence enhancement seems pretty likely.

[+]Alexander Gietelink Oldenziel3mo-50

Zack_M_Davis

Aug 07, 2025

1011

I think the emphasis is on inscrutable. If you didn't already know how the deep learning tech tree would go, you could have figured out that Cyc-like hard-coding of "tires are black" &c. is not the way, but you might have hoped that the nature of the learning algorithm would naturally lend itself to a reasonably detailed understanding of the learned content: that the learning algorithm produces "concept" data-structures in this-and-such format which accomplish these-and-such cognitive tasks in this-and-such way, even if there are a billion concepts.

brambleboy

Aug 07, 2025

This is what I think he means:

The object-level facts are not written by or comprehensible to humans, no. What's comprehensible is the algorithm the AI agent uses to form beliefs and make decisions based on those beliefs. Yudkowsky often compares gradient descent optimizing a model to evolution optimizing brains, so he seems to think that understanding the outer optimization algorithm is separate from understanding the inner algorithms of the neural network's "mind".

I think what he imagines as a non-inscrutable AI design is something vaguely like "This module takes in sense data and uses it to generate beliefs about the world which are represented as X and updated with algorithm Y, and algorithm Z generates actions, and they're graded with a utility function represented as W, and we can prove theorems and do experiments with all these things in order to make confident claims about what the whole system will do."(The true design would be way more complicated, but still comprehensible.)

[-]Nate Showell3mo1-2

Building on what you said, pre-LLM agent foundations research appears to have made the following assumptions about what advanced AI systems would be like:

Decision-making processes and ontologies are separable. An AI system's decision process can be isolated and connected to a different world-model, or vice versa.
The decision-making process is human-comprehensible and has a much shorter description length than the ontology.
As AI systems become more powerful, their decision processes approach a theoretically optimal decision theory that can also be succinctl

... (read more)

williawa

Aug 07, 2025

I think a natural way to think about this is ala AIXI, cleaving the system in two, prediction/world-modeling and values.

With this framing, I think most people would be fine with the world-model being inscrutable, if you can be confident the values are the right ones (and they need to be scrutable for this). I mean, for an ASI this kind of has to be the case. It will understand many thing about the world that we don't understand. But the values can be arbitrarily simple.

Kind of my hope for mechanistic interpretability is that we could find something isomorphic to a aixi (with inscrutable neural goop instead of turing machines) and then do surgery on the sensory reward part. And that this is feasible because 1) the aixi-like structure is scrutable 2) the values the AI has gotten from training probably will not be scrutable, but we can replace them with something that is, at least "structurally", scrutable.

[-]Donald Hobson3mo20

Be a little careful with this. It's possible to make the AI do all sorts of strange things via unusual world models. Ie a paperclip maximizing AI can believe "everything you see is a simulation, but the simulators will make paperclips in the real world if you do X"

If your confident that the world model is true, I think this isn't a problem.

1williawa3mo

I hesitate to say "confident". But I think you're not gonna have world models emerging LLMs that are wrapped in a "this is a simulation" layer.. probably? Also maybe even if they did, the procedure I'm describing, if it worked at all, would naively make them care about some simulated thing for its own sake. Not care about the simulated thing for instrumental reasons so it could get some other thing in the real world.

[-]faul_sname3mo20

And so under this world model we feel doomy about any system which decides based on its values which parts of the world are salient and worth modeling in detail, and which defines its values in terms of learned bits of its world model?

1williawa3mo

Yeah I think so, if I understand you correctly. Even on this view, you're gonna have to interface with parts of the learned world-model. Because that's the ontology you'll have to use when you specify the value function. Else I don't see how you'd get it in a format compatible with the searchy parts of the models cognition. So you'll get two problems: 1. Maybe the model doesn't model the things we care about 2. Even if it kind of models things we care about, its conception of those concepts might be slightly different than ours. So even if you pull off the surgery I'm describing. Identifying the things we care about in the models ontology (humans, kindness, corrigibility etc), stitch them together into a value function, and then implant this where the models learned value function would be. You still get tails come apart type stuff and you die. My hope would be that you could identify some notion of corrigibility. And use that instead of trying to implant some true value function, because that could be basin stable under reflection and amplification. Although that unfortunately seems harder than the true value function route.

2Cole Wyeth3mo

As (a) local AIXI researcher, I think 2 is the (very) hard part. Chronological Turing machines are an insultingly rich belief representation language, which seems to work against us here.

Noosphere89

Aug 09, 2025*

4-2

Yeah, I think the hope of us fully understanding a learned system was a fool's errand, and the dream of full interpretability was never actually possible (because of the very complicated sequences of the world and the fact that indexical complexity means your brain needs to be even more complicated, and Shane Legg proved that for Turing-computable learners, you can only predict/act on complex sequences by being that complex yourself).

To be frank, I think @Shane_Legg's paper predicted a lot of the reason why MIRI's efforts didn't work, because in practice, a computable theory of learning was just way more complicated than people thought at the time, and it turned out there was no clever shortcut, and the sorts of things that are easy to white-box are also the things that we can't get because they aren't computable by a Turing Machine.

More generally, one of the flaws in hindsight of early LW work, especially before 2012-2013, was not realizing that their attempts to relax the problem by introducing hypercomputers didn't work to give us new ideas, and the relaxed problem had no relation to the real problem of making AI safe as AI progresses in this world, such that solutions for one problem fail to transfer to the other problem.

[This comment is no longer endorsed by its author]

[-]Noosphere893mo*60

Here's your citation, @Steven Byrnes for the claim that for Turing-computable learners, you can only predict/act on complex sequences by being that complex yourself.

Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):

https://arxiv.org/abs/cs/0606070

The immediate corollary is as AI gets better, it's going to be inevitably more and more complicated by default, and it's not going to be any easier to interpret AIs, it will just get harder and harder to interpret the learned parts of the AI.

tailcalled

Aug 07, 2025

I think some of the optimism about scrutability might derive from reductionism. Like, if you've got a scrutable algorithm for maintaining a multilevel map, and you've got a scrutable model of the chemistry of a tire, you could pass through the multilevel model to find the higher-level description of the tire.

Random Developer

Aug 09, 2025

2-2

I suspect the crux here is whether or not you believe it's possible to have a "simple" model of intelligence. Intuitively, the question here is something like, "Does intelligence ultimately boil down to some kind of fancy logic? Or does it boil down to some kind of fancy linear algebra?"

The "fancy logic" view has a long history. When I started working as a programmer, my coworkers were veterans of the 80s AI boom and the following "AI winter." The key hope of those 80s expert systems was that you could encode knowledge using definitions and rules. This failed.

But the "fancy linear algebra" view pull ahead long ago. In the 90s, researchers in computational linguistics, computer vision and classification realized that linear algebra worked far better than fancy collections of rules. Many of these subfields leaped ahead. There were dissenters: Cyc continued to struggle off in a corner somewhere, and the semantic web tried to badly reinvent Prolog. The dissenters failed.

The dream of Cyc-like systems is eternal, and each new generation reinvents it. But it has systematically lost on nearly every benchmark of intelligence.

Fundamentally, real world intelligence has a number of properties:

The input is a big pile of numbers. Images are a pile of numbers. Sound is a pile of numbers.
Processing that input requires weighing many different pieces of evidence in complex ways.
The output of intelligence is a probability distribution. This is most obvious for tasks like speech recognition ("Did they say X? Probably. But they might have said Y.")

When you have a giant pile of numbers as input, a complex system for weighing those numbers, and a probability distribution as output, then your system is inevitably something very much like a giant matrix. (In practice, it turns out you need a bunch of smaller matrices connected by non-linearities.)

Before 2022, it appeared to me that Yudkowsky was trapped in the same mirage that trapped the creators of Cyc and the Semantic Web and 80s expert systems.

But in 2025, Yudkowsky appears to believe that the current threat absolutely comes from giant inscrutable matrices. And as far as I can tell, he has become very pessimistic about any kind of robust "alignment".

Personally, this is also my viewpoint: There is almost certainly no robust version of alignment, and even "approximate alignment will come under vast strain if we develop superhuman systems with goals. So I would answer your question in the affirmative: As far as I can see, inscrutability was always inevitable.

[-]Anthony Bailey3mo10

I appreciate the clear argument as to why "fancy linear algebra" works better than "fancy logic".

And I understand why things that work better tend to get selected.

I do challenge "inevitable" though. It doesn't help us to survive.

If linear algebra probably kills everyone but logic probably doesn't, tell everyone and agree to prefer to use the thing that works worse.

1Random Developer3mo

Thank you for your response! To clarify, my argument is that: 1. Logic- and rule-based systems fell behind in the 90s. And I don't see any way that they are ever likely to work, even if we had decades to work on them. 2. Systems with massive numbers of numeric parameters have worked exceptionally well, in many forms. Unfortunately, they're opaque and unpredictable, and therefore unsafe. 3. Given these two assumptions, the only two safety strategies are: a. A permanent, worldwide halt, almost certainly within the next 5-10 years. b. Build something smarter and eventually more powerful than us, and hope it likes keeping humans as pets, and does a reasonable job of it. I strongly support (3a). But this is a hard argument to make, because the key step of the argument is that "almost every successful AI algorithm of the past 30 years has been an opaque mass of numbers, and it has gotten worse with each generation." Anyway, thank you for giving me an opportunity to try to explain my argument a bit better!

Lorec

Aug 07, 2025

the human-created source code must be defining a learning algorithm of some sort. And then that learning algorithm will figure out for itself that tires are usually black etc. Might this learning algorithm be simple and legible? Yes! But that was true for GPT-3 too

Simple first-order learning algorithms have types of patterns they recognize, and meta-learning algorithms also have types of patterns they like.

In order to make a friendly or aligned AI, we will have to have some insight into what types of patterns we are going to have it recognize, and separately what types of things it is going to like or find salient.

There was a simple calculation protocol which generated GPT-3. The part that was not simple was translating that into predicting its preferences or perceptual landscape, and hence what it would do after it was turned on. And if you can't predict how a parameter will respond to input, you can't architect it one-shot.

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:28 AM

[-]Charlie Steiner3mo69

That Yuxi Liu article is great, thanks for the link.

[-]Eli Tyre3mo*40

Thanks for writing this. This kind of pointing out statements that seem intuitive on the surface but, don't obviously hold up (or at least require more explanation), is super high value.

[-]Donald Hobson3mo42

Imagine a system where you just type in the word "tires", and you get a list of everything the AI knows about tires, as english sentences or similar.

You can change the sentence from “tires are usually black” to “tires are usually pink”, and the AI's beliefs change accordingly.

This is, in some sense, a very scrutable AI system. And it's not obviously impossible.

Except suppose the AI went "tires are usually pink. Tires are usually dyed with amorphous carbon. Therefore amorphous carbon is pink. " and carries on like that, deducing a bizarre alternate reality version of chemistry where an electrons charge is if it's spin up, and $\sqrt{3} - i$ for spin down electrons. And it somehow all works out into a self consistent system of physics. And every fact you told the AI somehow matches up. (Many many facts you didn't tell the AI are wildly different)

Suddenly it doesn't feel so scrutible.

[-][anonymous]3mo20

You can change the sentence from “tires are usually black” to “tires are usually pink”, and the AI's beliefs change accordingly.
This is, in some sense, a very scrutable AI system. And it's not obviously impossible.

This isn't what Eliezer is talking about or interested in. From Dark Side Epistemology in the Sequences:

If you once tell a lie, the truth is ever after your enemy.
I have discussed the notion that lies are contagious. If you pick up a pebble from the driveway, and tell a geologist that you found it on a beach—well, do you know what a geologist knows about rocks? I don’t. But I can suspect that a water-worn pebble wouldn’t look like a droplet of frozen lava from a volcanic eruption. Do you know where the pebble in your driveway really came from? Things bear the marks of their places in a lawful universe; in that web, a lie is out of place.¹
What sounds like an arbitrary truth to one mind—one that could easily be replaced by a plausible lie—might be nailed down by a dozen linkages to the eyes of greater knowledge.

[+][comment deleted]3mo20

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

99

[ Question ]

Inscrutability was always inevitable, right?

99

99

9 Answers sorted by
top scoring

Aug 08, 2025*

Aug 07, 2025

Aug 07, 2025

Aug 07, 2025

Aug 07, 2025

Aug 09, 2025*

Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):

Aug 07, 2025

Aug 09, 2025

Aug 07, 2025

99

[ Question ]

Inscrutability was always inevitable, right?

99

99

9 Answers sorted by top scoring

Aug 08, 2025*

Aug 07, 2025

Aug 07, 2025

Aug 07, 2025

Aug 07, 2025

Aug 09, 2025*

Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):

Aug 07, 2025

Aug 09, 2025

Aug 07, 2025

9 Answers sorted by
top scoring