A Case for the Least Forgiving Take On Alignment

Thane Ruthenis

1. Introduction

The field of AI Alignment is a pre-paradigmic one, and the primary symptom of that is the wide diversity of views across it. Essentially every senior researcher has their own research direction, their own idea of what the core problem is and how to go about solving it.

The differing views can be categorized along many dimensions. Here, I'd like to focus on a specific cluster of views, one corresponding to the most "hardcore", unforgiving take on AI Alignment. It's the view held by people like Eliezer Yudkowsky, Nate Soares, and John Wentworth, and not shared by Paul Christiano or the staff of major AI Labs.

According to this view:

We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.
We need to align the AGI's values precisely right. "Rough" alignment won't work, niceness is not convergent, alignment attained at a low level of capabilities is unlikely to scale to superintelligence.
"Dodging" the alignment problem won't work. We can't securely hamstring the AGI's performance in some domain without compromising the AGI completely. We can't make it non-consequentialist, non-agenty, non-optimizing, non-goal-directed, et cetera. It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.
Automating research is impossible. Pre-AGI oracles, simulators, or research assistants won't generate useful results; cyborgism doesn't offer much hope. Conversely, if one such system would have the capability to meaningfully contribute to alignment, it'd need to be aligned itself. Catch-22.
Weak interpretability tools won't generalize to the AGI stage, as wouldn't other methods of "supervising" or "containing" the AGI.
Strong interpretability, perhaps rooted in agent-foundations insights, is promising, but the bar there is fairly high.
In sum: alignment is hard and requires exacting precision, AI can't help us with it, and instantiating an AGI without robustly solving alignment is certain to kill us all.

I share this view. In my case, there's a simple generator of it; a single belief that causes my predictions to diverge sharply from the more optimistic models.

From one side, this view postulates a sharp discontinuity, a phase change. Once a system gets to AGI, its capabilities will skyrocket, while its internal dynamics will shift dramatically. It will break "nonrobust" alignment guarantees. It will start thinking in ways that confuse previous interpretability efforts. It will implement strategies it never thought of before.

From another side, this view holds that any system which doesn't have the aforementioned problems will be useless for intellectual progress. Can't have a genius engineer who isn't also a genius schemer; can't have a scientist-modeling simulator which doesn't wake up to being a shoggoth.

What ties it all together is the belief that the general-intelligence property is binary. A system is either an AGI, or it isn't, with nothing in-between. If it is, it's qualitatively more capable than any pre-AGI system, and also works in qualitatively different ways. If it's not, it's fundamentally "lesser" than any generally-intelligent system, and doesn't have truly transformative capabilities.

In the rest of this post, I will outline some arguments for this, sketch out what "general intelligence" means in this framing, do a case-study of LLMs showcasing why this disagreement is so difficult to resolve, then elaborate on how the aforementioned alignment difficulties follow from it all.

Disclaimer: This post does not represent the views of Eliezer Yudkowsky, Nate Soares, or John Wentworth. I am fairly confident that I'm pointing towards an actual divergence between their models and the models of most AI researchers, but they may (and do) disagree with the framings I'm using, or the importance I ascribe to this specific divergence.

2. Why Believe This?

It may seem fairly idiosyncratic. At face value, human cognition is incredibly complex and messy. We don't properly understand it, we don't understand how current AIs work either — whyever would we assume there's some single underlying principle all general intelligence follows? Even if it's possible, why would we expect it?

First, let me draw a couple analogies to normalize the idea.

Exhibit A: Turing-completeness. If a set of principles for manipulating data meets this requirement, it's "universal", and in its universality it's qualitatively more capable than any system which falls "just short" of meeting it. A Turing-complete system can model any computable mathematical system, including any other Turing-complete system. A system which isn't Turing-complete... can't.

Exhibit B: Probability theory. It could be characterized as the "correct" system for doing inference from a limited first-person perspective, such that anything which reasons correctly would implement it. And this bold claim has solid theoretical support: a simple set of desiderata uniquely constrains the axioms of probability theory, while any deviation from these desiderata leads to a very malfunctioning system. (See e. g. the first chapters of Jaynes' Probability Theory.)

Thus, we have "existence proofs" that (A) the presence of some qualitatively-significant capabilities is a binary variable, and (B) the mathematical structure of reality may be "constraining" some capabilities such that they can only be implemented one way.

In addition, note that both of those are "low bars" to meet — it doesn't take much to make a system Turing-complete, and the probability-theory axioms are simple.

3. Is "General Intelligence" a Thing?

Well, it's a term we use to refer to human intelligence, and humans exist, so yes. But what specifically do we mean by it? In what sense are humans "general", in what sense is it "a thing"?

Two points, mirrors of the previous pair:

Point 1: Human intelligence is Turing-complete. We can imagine and model any mathematical object. We can also chunk them, or abstract over them, transforming systems of them into different mathematical objects. That process greatly decreases the working-memory load, allowing us to reason about incredibly complex systems by reducing them to their high-level behavior. A long sequence of individual chess-figure moves becomes a strategy; a mass of traders becomes a market; a sequence of words and imagined events become scenes and plot arcs.

As we do so, though, a change takes place. The resulting abstractions don't behave like the parts they're composed of, they become different mathematical objects entirely. A ball follows different rules than the subatomic particles it's made of; the rules of narrative have little to do with the rules of grammar. Yet, we're able to master all of it.

Further: Inasmuch as reductionism is true, inasmuch as there are no ontologically basic complex objects, inasmuch as everything can be described as a mathematical object — that implies that humans are capable of comprehending any system and problem-solving in any possible environment.

We may run into working-memory or processing limits, yes — some systems may be too complex to fit into our brain. But with pen and paper, we may be able to model even them, and in any case it's a quantitative limitation. Qualitatively speaking, human cognition is universally capable.

Point 2: This kind of general capability seems necessary. Any agent instantiated in the universe would be embedded: it'd need to operate in a world larger than can fit in its mind, not the least because its mind will be part of it. Fortunately, the universe provides structures to "accommodate" agents: as above, it abstracts well. There are regularities and common patterns everywhere. Principles generalize and can be compactly summarized. Lazy world-modeling is possible.

However, that requires the aforementioned capability to model arbitrary mathematical objects. You never know what the next level of abstraction will be like, how objects on it will behave, from biology to chemistry to particle physics to quantum mechanics to geopolitics. You have to be able to adapt to anything, model anything. And if you can't do that, that means you can't build efficient world-models, and can't function as an embedded agent.

Much like reality forces any reasoner to follow the rules of probability theory, it forces any agent into... this.

Thus, (1) there is a way to be generally capable, exemplified by humans, and (2) it seems that any "generally capable" agent would need to be generally capable in the exact sense that humans are.

4. What Is "General Intelligence"?

The previous section offers one view, a view that I personally think gets at the core of it. One of John Wentworth's posts presents a somewhat different frame, as does this post of nostalgebraist's.

Here's a few more angles to look at it from:

It's something like "the ability to navigate any environment". It's a set of capabilities that allow to construct and "understand" arbitrary mathematical objects, manipulate them, and fluidly incorporate them into problem-solving.
It's a "heuristics generator". It's some component of cognition such that, when prompted with an environment, it quickly converges towards some guidelines for good performance in it — without needing a lot of trial-and-error.
It's a principled way of drawing upon the knowledge contained in the world-model. World-models are likely nicely-structured, and general intelligence is the ability to stay up-to-date on your world-model and run queries on it most relevant to your current task. Instead of learning what to query for by painful experience, a general intelligence can instantly "loop in" even very surprising information, as long as it becomes represented in its world-model.
It's consequentialism/agency: the ability to instantly adapt one's policy in response to changes in the environment and stay aimed at your goal. Rather than retrieving a cached solution, it's the ability to solve the specific problem you're presented with; to always walk the path to the desired outcome because it's the path to the desired outcome.
It's autonomy: the ability to stay "on-track" when working across multiple environments and abstraction levels; without being distracted, pulled in different directions, or completely stumped.

There's a number of threads running through these interpretations:

One is universality, which I've already discussed.
Another is something like "active adaptability", or "being present in the moment". A general intelligence is not an adaptation-executor; a general intelligence is an algorithm that mindfully decides how to adapt. It may defer to a learned heuristic in certain situations, but whenever that happens, it’s only because its outer cognitive loop has decided that that heuristic is the appropriate tool for the job.^[1]
The third is goal-directedness. (4) and (5) talk about it explicitly, but it’s present in the others as well. “Learning to use novel abstractions” implies something for which they will be used. A “heuristics generator” would need to know for what to refine its heuristics. A query on a world-model would be looking for an output satisfying some specifications.

The goal-directedness is the particularly important part. To be clear: by it, I don't mean that a generally intelligent mind ought to have a fixed goal it’s optimizing for. On the contrary, general intelligence’s generality extends to being retargetable towards arbitrary objectives. But every momentary step of general reasoning is always a step towards some outcome. Every call of the function implementing general intelligence has to take in some objective as an input — else it's invalid, a query on an empty string.

Goal-directedness, thus, is inextricable from general intelligence. “The vehicle of cross-domain goal-pursuit” is what intelligence is.

5. A Caveat

One subtle point I've omitted so far is that, while achieving generality is supposed to cause systems to dramatically jump in capabilities, it doesn't have to happen instantly. A system may need to "grow" into its intelligence. The mechanistic discontinuity, when the core of general intelligence is assembled, would slightly precede the "capabilistic" one, when the potential of that core is properly realized.

The homo sapiens sapiens spent thousands of years hunter-gathering before starting up civilization, even after achieving modern brain size. Similarly, when whatever learning algorithm we're using builds general intelligence into an AI, it would not instantly become outsmart-all-of-humanity superintelligent (well, probably not).

The reason is that, while general-intelligence algorithms are equal in their generality, that doesn't mean generally-intelligent minds don't vary on other axes.

The GI component may have different amounts of compute assigned to it: like humans have different g-factors, differently-sized working memory.
The GI component may be employed more or less frequently: individual humans are not generally intelligent when they're not concentrating.
The GI component may be more or less swayable by heuristics: like humans' conscious decisions are often overridden by instincts.
The mind may simply not be "skilled" in directing its generally-intelligent reasoning. Some meta-cognitive knowledge is required to do it well. Equally-intelligent humans may be better or worse at it (LW-esque rationality is essentially a discipline for cultivating such skills).^[2]

So when the GI component first coalesces, it may have very little compute for itself, it may not be often employed, it may defer to heuristics in most cases, and the wider system wouldn't yet know how to employ it well.^[3] It would still be generally capable in the limit, but it wouldn't be instantly omnicide-capable. It would take some time for the result of the mechanistic discontinuity to become properly represented at the level of externally-visible capabilities.

Thus, in theory, there may be a small margin of capability where we'd have a proper AGI that nonetheless can't easily take over us. At face value, seems like this should invalidate the entire "we won't be able to iterate on AGI systems" concern...

The problem is that it'd be very, very difficult to catch that moment and to take proper advantage of it. Most approaches to alignment are not on track to do it. Primarily, because those approaches don't believe in the mechanistic discontinuity at all, and don't even know that there's some crucial moment to be carefully taking advantage of.

There's three problems:

1) A "weak" AGI is largely a pre-AGI system.

Imagine a "weak" AGI as described above. The GI component doesn't have much resources allocated to it, it's often overridden, so on. Thus, that system's cognitive mechanics and behavior are still mostly determined by specialized problem-solving algorithms/heuristics, not general intelligence. The contributions of the GI component are a rounding error.

As such, most of the lessons we learn from naively experimenting with this system would be lessons about pre-AGI systems, not AGI systems! There would be high-visible-impact interpretability or alignment techniques that ignore the GI component entirely, since it's so weak and controls so little. On the flip side, no technique that spends most of its effort on aligning the GI component would look cost-effective to us.

Thus, unless we deliberately target the GI component (which requires actually deciding to do so, which requires knowing that it exists and is crucial to align), iterating on a "weak" AGI will just result in us developing techniques for pre-AGI systems. Techniques that won't scale once the "weak" label falls off.

Conversely, the moment the general-intelligence component does become dominant — the moment any alignment approach would be forced to address it — is likely the moment the AI becomes significantly smarter than humans. And at that point, it'd be too late to do alignment-by-iteration.

The discontinuity there doesn't have to be as dramatic as hard take-off/FOOM is usually portrayed. The AGI may stall at a slightly-above-human capability, and that would be enough. The danger lies in the fact that we won't be prepared for it, would have no tools to counteract its new capabilities at all. It may not instantly become beyond humanity's theoretical ability to contain — but it'd start holding the initiative, and will easily outpace our efforts to catch up. (Discussing why even "slightly" superintelligent AGIs are an omnicide risk is outside the scope of this post; there are other materials that cover this well.)

Don't get me wrong: having a safely-weak AGI at hand to experiment with would be helpful for learning to align even "mature" AGIs. But we would need to make very sure that our experiments are targeting the right feature of that system. Which, in all likelihood, requires very strong interpretability tools: we'd need "a firehose of information" on the AI's internals to catch the moment.

2) We may be in an "agency overhang". As nostalgebraist's post on autonomy mentions, modern AIs aren't really trained to be deeply agentic/goal-directed. Arguably, we don't yet know how to do it at all. It may require a paradigm shift similar to the invention of transformers.

And yet, modern LLMs are incredibly capable even without that. If we assume they're not generally intelligent, that'd imply they have instincts dramatically more advanced than any animal's. So advanced we often mistake them for AGI!

Thus, the concern: the moment we figure out how to even slightly incentivize general intelligence, the very first AGI will become strongly superintelligent. It'd be given compute and training far in excess of what AGI "minimally" needs, and so it'd instantly develop general intelligence as far ahead of humans' as LLMs' instincts are ahead of human instincts. The transition between the mechanistic and the capabilitisc discontinuity would happen within a few steps of a single training run — so, effectively, there wouldn't actually be a gap between them.

In this case, the hard take-off will be very hard indeed.

A trick that we might try is deliberately catching AGI in-training: Design interpretability tools for detecting the "core of general intelligence", continuously run them as we train. The very moment they detect GI forming, we stop the training, and extract a weak, omnicide-incapable AGI. We then do iterative experimentation on it as usual (although that would be non-trivial to get right as well, see point 1).

That still has some problems:

It'd require fairly advanced interpretability tools, tools we don't yet have.
The transition from a "weak" AGI to a superintelligence may be very fast, so we'd need to pause-and-interpret the model very frequently during the training. That'd potentially significantly increase the costs and time required.
The resultant "weak AGI" may still be incredibly dangerous. Not instantly omnicidal, but perhaps on the very verge of that. (Consider how dangerous the upload of a human genius would be.)

I do think this can be a component of some viable alignment plans. But it's by no means trivial.

3) We may not notice "weak" AGI while staring right at it.

The previous possibility assumed that modern LLMs are not AGI. Except, how do we know that?

6. The Case of LLMs

I'll be honest: LLMs freak me out as much as they do anyone. As will be outlined, I have strong theoretical reasons to believe that they're not generally intelligent, and that general intelligence isn't reachable by scaling them up. But looking at some of their outputs sure makes me nervously double-check my assumptions.

There's a fundamental problem: in the case of AI, the presence vs. absence of general intelligence at non-superintelligent levels is very difficult to verify externally. I've alluded to it some already, when mentioning that "weak" AGIs, in their makeup and behavior, are mostly pre-AGI systems.

There are some obvious tell-tale signs in both directions. If it can only output gibberish, it's certainly not an AGI; if it just outsmarted its gatekeepers and took over the world, it's surely an AGI. But between the two extremes, there's a grey area. LLMs are in it.

To start the analysis off, let's suppose that LLMs are entirely pre-AGI. They don't contain a coalesced core of true generality, not even an "underfunded" one. On that assumption, how do they work?

Suppose that we prompt a LLM with the following:

vulpnftj = -1
3 + vulpnftj =

LLMs somehow figure out that the answer is "2". It's highly unlikely that "vulpnftj" was ever used as a variable in their training data, yet they somehow know to treat it as one. How?

We can imagine that there's a "math engine" in there somewhere, and it has a data structure like "a list of variables" consisting of {name; value} entries. The LLM parses the prompt, then slots "vulpnftj" and "-1" into the corresponding fields. Then it knows that "vulpnftj" equals "-1".

That's a kind of "learning", isn't it? It lifts completely new information from the context and adapts its policy to suit. But it's a very unimpressive kind of learning. It's only learning in a known, pre-computed way.

I claim that this is how LLMs do everything. Their seeming sophistication is because this trick scales far up the abstraction levels.

Imagine a tree of problem-solving modules, which grow increasingly more abstract as you ascend. At the lowest levels, we have modules like "learn the name of a variable: %placeholder%". We go up one level, and see a module like "solve an arithmetic equation", with a field for the equation's structure. Up another level, and we have "solve an equation", with some parameters that, if filled, can adapt this module for solving arithmetic equations, differential equations, or some other kinds of equations (even very esoteric ones). Up, up, up, and we have "do mathematical reasoning", with parameters that codify modules for solving all kinds of math problems.

When an LLM analyses a prompt, it figures out it's doing math, figures out what specific math is happening, slots all that data in the right places, and its policy snaps into the right configuration for the problem.

And if we go sideways from "do math", we'd have trees of modules for "do philosophy", "do literary analysis", "do physics", and so on. If we'd instead prompted it with a request to ponder the meaning of life as if it were Genghis Khan, it would've used different modules, adapted its policy to the context in different ways, called up different subroutines. Retrieve information about Genghis Khan, retrieve the data about the state of philosophy in the 13th century, constrain the probability distribution over the human philosophical outlook by these two abstractions, distill the result into a linguistic structure, extract the first token, output it...

A wealth of possible configurations like this, a combinatorically large number of them, sufficient for basically any prompt you may imagine.

But it's still, fundamentally, adapting in known ways. It doesn't have a mechanism for developing new modules; the gradient descent has always handled that part. The LLM contains a wealth of crystallized intelligence, but zero fluid intelligence. A static set of abstractions it knows, a closed range of environments it can learn to navigate. Still "just" interpolation.

For state-of-the-art LLMs, that crystallized structure is so extensive it contains basically every abstraction known to man. Therefore, it's very difficult to come up with some problem, some domain, that they don't have an already pre-computed solution-path for.

Consider also the generalization effect. The ability to learn to treat "vulpnftj" as a variable implies the ability to learn to treat any arbitrary string as a variable. Extending that, the ability to mimic the writing styles of a thousand authors implies the ability to "slot in" any style, including one a human talking to it has just invented on the fly. The ability to write in a hundred programming languages... implies, perhaps, the ability to write in any programming language. The mastery of a hundred board games generalizes to the one-hundred-and-first one, even if that one is novel. And so on.

In the limit, yes, that goes all the way to full general intelligence. Perhaps the abstraction tree only grows to a finite height, perhaps there are only so many "truly unique" types of problems to solve.

But the current paradigm may be a ruinously inefficient way to approach that limit:

There are lots of algorithms which are Turing-complete or ‘universal’ in some sense; there are lots of algorithms like AIXI which solve AI in some theoretical sense (Schmidhuber & company have many of these cute algorithms such as ‘the fastest possible algorithm for all problems’, with the minor catch of some constant factors which require computers bigger than the universe).
Why think pretraining or sequence modeling is not another one of them? Sure, if the model got a low enough loss, it’d have to be intelligent, but how could you prove that would happen in practice?

Yet it still suffices to foil the obvious test for AGI-ness, i. e. checking whether the AI can be "creative". How exactly do you test an LLM on that? Come up with a new game and see if it can play it? If it can, that doesn't prove much. Maybe that game is located very close, in the concept-space, to a couple other games the LLM was already fed, and deriving the optimal policy for it is as simple as doing a weighted sum of the policies for the other two.

Some tasks along these lines would be a definitive proof — like asking it to invent a new field of science on the fly. But, well, that's too high a bar. Not any AGI can meet it, only a strongly superintelligent AGI, and such a system would be past the level at which it can defeat humanity. It'd be too late to ask it questions then, because it'll have already eaten us.

I think, as far as current LLMs are concerned, there's still some vague felt-sense in which all their ideas are "stale". In-distribution for what humanity has already produced, not "truly" novel, not as creative as even a median human. No scientific breakthroughs, no economy-upturning startup pitches, certainly no mind-hacking memes. Just bounded variations on the known. The fact that people do this sort of stuff, and nothing much comes of it, is some evidence for this, as well.

It makes sense in the context of LLMs' architecture and training loops, too. They weren't trained to be generally and autonomously intelligent; their architecture is a poor fit for that in several ways.

But how can we be sure?

The problem, fundamentally, is that we have no idea how the problem-space looks like. We don't know and can't measure in which directions it's easy to generalize or not, we don't know with precision how impressive AI is getting. We don't know how to tell an advanced pre-AGI system from a "weak" AGI, and have no suitable interpretability tools for a direct check.

And thus we'd be unable to tell when AI — very slowly at first, and then explosively — starts generalizing off-distribution, in ways only possible for the generally intelligent, arbitrary-environment-navigating, goal-directed things. We'd miss the narrow interval in which our AGIs were weak enough that we could survive failing to align them and get detailed experience from our failures (should there be such an interval at all). And the moment at which it'll become clear that we're overlooking something, would be the exact moment it'd be too late to do anything about it.

That is what "no fire alarm" means.

7. The Subsequent Difficulties

All right, it's finally time to loop back around to our initial concerns. Suppose general intelligence is indeed binary, or "approximately" so. How does just that make alignment so much harder?

At the fundamental level, this means that AGI-level systems work in a qualitatively different way from pre-AGI ones. Specifically, they think in a completely novel way. The mechanics of fluid intelligence — the processes needed to efficiently derive novel heuristics, to reason in a consequentialist manner — don't resemble the mechanics of vast crystallized-intelligence structures.

That creates a swath of problems. Some examples:

It breaks "weak" interpretability tools. If we adapt them to pre-AGI systems, they would necessarily depend on there being a static set of heuristics/problem-solving modules. They would identify modules corresponding to e. g. "deception", and report when those are in use. A true AGI, however, would be able to spin off novel modules that fulfill a similar function in a round-about way. Our tools would not have built-in functionality for actively keeping up with a dynamically morphing system, so they will fail to catch that, to generalize. (Whereas destroying the AI's ability to spin off novel modules would mean taking the "G" out of "AGI".)

As I'd mentioned, for these purposes "weak" AGIs are basically equivalent to pre-AGI systems. If the general-intelligence component isn't yet dominant, it's not doing this sort of module-rewriting at scale. So interpretability tools naively adapted for "weak" AGIs would be free to ignore that aspect, and they'd still be effective... And would predictably fail once the GI component does grow more powerful.

It breaks "selective" hamstringing. Trying to limit an AGI's capabilities, to make it incapable of thinking about harming humans or deceiving them, runs into the same problem as above. While we're operating on pre-AGI systems, mechanistically this means erasing/suppressing the corresponding modules. But once we get to AGI, once the system can create novel modules/thought-patterns on the fly... It'd develop ways to work around.

It breaks "nonrobust" goal-alignment. In a pre-AGI system, the "seat of capabilities" are the heuristics, i. e. the vast crystallized-intelligence structures of problem-solving modules. "Aligning" them, to wit, means re-optimizing these heuristics such that the AI reflexively discards plans that harm humans, and reflexively furthers plans that help humans. If we take on the shard-theory frame, it can mean cultivating a strong shard that values humans, and bids to protect their interests.

Aligning an AGI is a different problem. Shards/heuristics are not the same thing as the goals/mesa-objectives the AGI would pursue — they're fundamentally different types of objects. If it works anything like how it does in humans, perhaps mesa-objectives would be based on or inspired by shards. But how exactly the general-intelligence module would interpret them is under question. It's unlikely to be a 1-to-1 mapping, however: much like human emotional urges and instincts do not map 1-to-1 to the values we arrive at via moral philosophy.

One thing that seems certain, however, is that shards would lose direct control over the AGI's decisions. It would be an internal parallel to what would happen to our pre-AGI interpretability or hamstringing tools — heuristics/shards simply wouldn't have the machinery to automatically keep up with an AGI-level system. The aforementioned "protect humans" shard, for example, would only know to bid against plans that harm humans in some specific mental contexts, or in response to specific kinds of harm. Once the AGI develops new ways to think about reality, the shard would not even know to try to adapt. And afterwards, if the GI component were so inclined, it would be able to extinguish that shard, facing no resistance.

A human-relatable parallel would be someone going to exposure therapy to get rid of a phobia, or a kind person deciding to endorse murder when thinking about it in a detached utilitarian framework. When we reflect upon our values, we sometimes come to startling results, or decide to suppress our natural urges — and we're often successful in that.

Pre-AGI alignment would not necessarily break — if it indeed works like it does in humans. But the process of value reflection seems highly unstable, and its output is a non-linear function of the entirety of the initial desires. "If there's a shard that values humans, the AGI will still value humans post-reflection" does not hold, by default. "Shard-desires are more likely to survive post-reflection the stronger they are, and the very strong will definitely survive" is likewise invalid.

Thus, the alignment of a pre-AGI system doesn't guarantee that this system will remain aligned past the AGI discontinuity; and it probably wouldn't. If we want to robustly align an AGI, we have to target the GI component directly, not through the unreliable proxy of shards/heuristics.

It leads to a dramatic capability jump. Consider grokking. The gradient descent gradually builds some algorithmic machinery into an AI. Then, once it's complete, that machinery "snaps together", and the AI becomes sharply more capable in some way. The transition from a pre-AGI system to a mature AGI can be viewed as the theorized most extreme instance of grokking — that's essentially what the sharp left turn is.

Looking at it from the outside, however, we won't see the gradual build-up (unless, again, we have very strong interpretability tools specifically for that). We'd just see the capabilities abruptly skyrocketing, and generalizing in ways we haven't seen before. In ways we didn't predict, and couldn't prepare for.

And it would be exactly the point at which things like recursive self-improvement become possible. Not in the sort of overdramatic way in which FOOM is often portrayed, but in the same sense in which a human trying to get better at something self-improves, or in which human civilization advances its industry.

Crucially, it would involve an AI whose capabilities grow as the result of its own cognition; not as the result of the gradient descend improving it. A static tree of heuristics, no matter how advanced, can't do that. A tree of heuristics deeply interwoven with the machinery for deriving novel heuristics... can.

Which, coincidentally, is another trick that tools optimized for the alignment of pre-AGI systems won't know how to defeat.

The unifying theme is that we won't be able to iterate. Pre-AGI interpretability, safeguards, alignment guarantees, scaling laws, and all other approaches that fail to consider the AGI discontinuity — would ignobly fail at the AGI discontinuity.

As per Section 5, in theory iteration is possible. Not all AGIs are superhuman, and we can theoretically "catch" a "weak" AGI, and experiment with it, and derive lessons from that experimentation that would generalize to strongly superintelligent systems. But that's incredibly hard to do right without very advanced interpretability tools, and the situation would likely be highly unstable, with the "caught" AGI still presenting a massive threat.

Okay, so AGI is highly problematic. Can we manage without it?

Can "limitedly superhuman" AIs suffice? That is, systems that have superhuman competencies in some narrow and "safe" domains, like math. Or ones that don't have "desires", like oracles or simulators. Or ones that aren't self-reflective, or don't optimize too strongly, or don't reason in a consequentialist manner...

It should be clear, in the context of this post, that this is an incoherent design specification. Useful creativity, truly-general intelligence, and goal-directedness are inextricable from each other. They're just different ways of looking at the same algorithm.

On this view, there aren't actually any "domains" in which general intelligence can be "specialized". Consider math. Different fields of it consist of objects that behave in drastically different ways, and inventing a novel field would require comprehending a suite of novel abstractions and navigating them. If a system can do that, it has the fundamental machinery for general intelligence, and therefore for inventing deception and strategic scheming. If it can't... Well, it's not much use.

Similar for physics, and even more so for engineering. If math problems can be often defined in ways that don't refer to the physical reality at all, engineering problems and design specifications would talk about reality. To solve such problems, an AGI would need not only the basic general-intelligence machinery, but also a suite of crystallized intelligence modules for reasoning about reality. Not just the theoretical ability to learn how to achieve real goals, but the actual knowledge of it.

Most severely it applies to various "automate alignment" ideas. Whether by way of prompting a simulator to generate future alignment results, or by training some specialized "research assistant" model for it... Either the result won't be an AGI, and therefore won't actually contribute novel results, or it would be an AGI, and therefore an existential threat.

There's nothing in-between.

What about generative world-models/simulators, specifically? This family of alignment proposals is based on the underlying assumption that a simulator itself is goal-less. It's analogized to the laws of physics — it can implement agents, and these agents are dangerous and in need of alignment... But the simulator is not an agent of its own, and not a threat.

The caveat is that a simulator is not literally implemented as a simulation of physics (or language), even if it can be viewed as such. That would be ruinously compute-intensive, far in excess of what LLMs actually consume. No, mechanistically, it's a complex suite of heuristics. A simulator pushed to AGI, then, would consist of a suite of heuristics in control of a generally-intelligent goal-directed process... Same as, say, any reinforcement-learning agent.

Expecting that to keep on being a simulator is essentially expecting this AGI to end up inner-aligned to the token-prediction objective. And there's no reason to expect that in the case of simulators, any more than there's reason to expect it for any other training objective.

In the end, we will get an AGI with some desires that shallowly correlate with token-prediction, a "shoggoth" as it's often nicknamed. It will reflect on its desires, and come to unpredictable, likely omnicidal conclusions. Business as usual.

What about scalable oversight, such as pursued by OpenAI? Its failure follows from the intersection of a few ideas discussed above. The hard part of the alignment problem is figuring out how to align the GI component. If we're not assuming that problem away, here, the AIs doing the oversight would have to be pre-AGI models (which we roughly do know how to align). But much like weak interpretability tools, or shards, these models would not be able to keep up with AGI-level shifting cognition. Otherwise, they wouldn't be "pre"-AGI, since this sort of adaptability is what defines general intelligence.

And so we're back at square one.

Thus, once this process scales to AGI-level models, its alignment guarantees will predictably break.

8. Closing Thoughts

To sum it all up: As outlined here, I'm deeply skeptical, to the point of dismissiveness, of a large swathe of alignment approaches. The underlying reason is a model that assumes a sharp mechanistic discontinuity at the switch to AGI. Approaches that fail to pay any mind to that discontinuity, thus, look obviously doomed to me. Such approaches miss the target entirely: they focus on shaping the features of the system that play a major part now, but will fall into irrelevance once general intelligence forms, while ignoring the component of AI that will actually be placed in charge at the level of superintelligence.

In addition, there's a pervasive Catch-22 at play. Certain capabilities, like universally flexible adaptability and useful creativity, can only be implemented via the general-intelligence algorithm. As the result, there's no system that can automatically adapt to the AGI discontinuity except another generally-intelligent entity. Thus, to align an AGI, we either need an aligned AGI... or we need to do it manually, using human general intelligence.

It's worth stating, however, that I don't consider alignment to be impossible, or even too hard to be realistically solved. While Eliezer/Nate may have P(doom) at perhaps 90+%, John expects survival with "better than a 50/50 chance", and I'm leaning towards the latter estimate as well.

But what I do think is that we won't get to have shortcuts and second chances. Clever schemes for circumventing or easing the alignment problem won't work, and reality won't forgive us for not getting it exactly right.

By the time we're deploying AGI, we have to have a precise way of aiming such systems. Otherwise, yes, we are hopelessly doomed.

^{^}
A general intelligence may also be suppressed by an instinct firing off, as sometimes happens with humans. But that’s a feature of the wider mind the GI is embedded in, not of general intelligence itself.
^{^}
This is one of the places where my position seems at odds with e. g. Eliezer's, although I think the disagreement is largely semantical. He sometimes talks about AIs that are "more general" than humans, providing an example of an AI capable of rewriting its cognitive algorithms on the fly to be able to write bug-free code. Here, he doesn't make a distinction between the fundamental capabilities of the general-intelligence algorithm, and the properties of a specific mind in which GI is embedded.
Imagine an AGI as above, able to arbitrarily rewrite its mental subroutines, but with a twist: there's a secondary "overseer" AGI on top of it, and its sole purpose is to delete the "program perfectly in Python" module whenever the first AGI tries to create it. The system as a whole would be "less general" than the first AGI alone, but not due to some lacking algorithmic capability.
Similar with humans: we possess the full general-intelligence algorithm, it just doesn't have write-access to certain regions of our minds.
^{^}
Or it may be instantly given terabytes of working memory, an overriding authority, and a first task like "figure out how to best use yourself" which it'd then fulfill gloriously. That depends on the exact path the AI's model takes to get there: maybe the GI component would grow out of some advanced pre-GI planning module, which would've already enjoyed all these benefits?

My baseline prediction is that it'd be pretty powerful from the start. But I will be assuming the more optimistic scenario in the rest of this post: my points work even if the GI starts out weak.

Agreed that this (or something near it) appears to be a relatively central difference between people's models, and probably at the root of a lot of our disagreement. I think this disagreement is quite old; you can see bits of it crop up in Hanson's posts on the "AI foom" concept way back when. I would put myself in the camp of "there is no such binary intelligence property left for us to unlock". What would you expect to observe, if a binary/sharp threshold of generality did not exist?

A possibly-relevant consideration in the analogy to computation is that the threshold of Turing completeness is in some sense extremely low (see one-instruction set computer, Turing tarpits, Rule 110), and is the final threshold. Rather than a phase shift at the high end, where one must accrue a bunch of major insights before one has a system that they can learn about "computation in general" from, with Turing completeness, one can build very minimal systems and then--in a sense--learn everything that there is to learn from the more complex systems. It seems plausible to me that cognition is just like this. This raises an additional question beyond the first: What would you expect to observe, if there... (read more)

9Thane Ruthenis3y

Great question! I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us. E. g., there'd be things like, we'd see octopi frequently executing some sequences of actions that lead to beneficial outcomes for them, and we would be fundamentally unable to understand what is happening. As it is, sure, some animals have specialized cognitive algorithms that may be better than human ones in their specific niches, but we seem to always be able to comprehend them. We can always figure out why they decide to execute various plans, based on what evidence, and how these plans lead to whatever successes they achieve. A human can model any animal's cognition; a human's cognition is qualitatively more capable than any animal's. If true generality didn't exist, I'd expect that not to be true. Scaling it up, the universe as a whole would be more incomprehensible. I'd referred to ontologically complex processes when discussing that in Section 3 — processes such that there are no cognitive features in our minds that would allow us to emulate them. That'd be the case all over the place: we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning. The fact that we observe neither (and that this state of affairs is even hard/impossible for us to imagine) suggests that we're fully general, in the sense outlined in the post. [...] Yup. But I think there are some caveats here. General intelligence isn't just "some cognitive system that has a Turing-complete component inside it", it's "a Turing-complete system for manipulating some specific representations". I think general intelligence happens when we amass some critical mass of shards/heuristics + world-model concepts they're defined over, then some component of that system (planner? shard-bid resolver? cross-heuristic communication channel?

Thanks! Appreciate that you were willing to go through with this exercise.

I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.
[...]
we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning.

Per reductionism, nothing should be fundamentally incomprehensible or fundamentally beyond reckoning, unless we posit some binary threshold of reckoning-generality. Everything that works reliably operates by way of lawful/robust mechanisms, so arriving at comprehension should look like gradually unraveling those mechanisms, searching for the most important pockets of causal/computational reducibility. That requires investment in the form of time and cumulative mental work.

I think that the behavior of other animals & especially the universe as a whole in fact did start off as very incomprehensible to us, just as incomprehensible as it was to other species. In my view, what caused the transformation from incomprehensibility to comprehensibility was not humans going over a sharp cognitive/archite... (read more)

5Thane Ruthenis3y

Exactly; see my initial points about Turing-completeness. But exploiting this property of reality, being able to "arrive at comprehension by gradually unraveling the mechanisms by which the world works", is nonetheless a meaningfully nontrivial ability. Consider an algorithm implementing a simple arithmetic calculator, or a symbolic AI from a FPS game, or LLMs as they're characterized in this post. These cognitive systems do not have the machinery to arrive at understanding this way. There are no execution-paths of their algorithms such that they arrive at understanding; no algorithm-states that correspond to "this system has just learned a new physics discipline". This is how I view animals as well. If true generality doesn't exist, it would stand to reason that humans are the same. There should be aspects of reality such that there's no brain-states of us that correspond to us understanding them; there should only be a limited range of abstract objects our mental representations can support. The ability to expand our mental ontology in a controlled manner, and stay in lockstep with this expansion, always able to fluidly employ for problem-solving the new concepts learned, is exactly the ability I associate with general intelligence. The existence of calculators/FPS AI/maybe-LLMs, which are incapable of this, shows that this isn't a trivial ability. And the suggestive connection with Turing-completeness hints that it may be binary. Maybe "falls into the basin of being able to understand anything" would be a clearer way to put it? [...] Hmm, maybe I didn't understand your hypothetical: [...] To me, this sounds like you're postulating the existence of a simple algorithm for general-purpose problem-solving which is such that it would be convergently learned by circa-1995 RNNs. Rephrasing, this hypothetical assumes that the same algorithm can be applied to efficiently solve a wide variety of problems, and that it can usefully work even at the level of complexit

I think I am confused where you're thinking the "binary/sharp threshold" is.

Are you saying there's some step-change in the architecture of the mind, in the basic adaption/learning algorithms that the architecture runs, in the content those algorithms learn? (or in something else?)

If you're talking about...

... an architectural change → Turing machines and their neural equivalents, for example, over, say, DFAs and simple associative memories. There is a binary threshold going from non-general to general architectures, where the latter can support programs/algorithms that the former cannot emulate. This includes whatever programs implement "understanding an arbitrary new domain" as you mentioned. But once we cross that very minimal threshold (namely, combining memory with finite state control), remaining improvements come mostly from increasing memory capacity and finding better algorithms to run, neither of which are a single binary threshold. Humans and many non-human animals alike seem to have similarly general architectures, and likewise general artificial architectures have existed for a long time, so I would say "there indeed is a binary/sharp threshold [in architectures] but it

... (read more)

an architectural change → Turing machines and their neural equivalents

This, yes. I think I see where the disconnect is, but I'm not sure how to bridge it. Let's try...

To become universally capable, a system needs two things:

"Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
"General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.

General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.

Consider a system that has (1) but not (2), such as your "memory + finite state control" example. While, yes, this system meets the requirements for Turing-complete world-modeling, this capability can't be leveraged. Suppose it assembles a completely new region of its world-model. What would it do with it? It needs to leverage that knowledge for constructing practically-implementable plans, but its policy function/heuristics is a separate piece of cognition. So either needs:

... (read more)

Ok I think this at least clears things up a bit.

To become universally capable, a system needs two things:
"Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
"General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.
General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.

Based on the above, I don't understand why you expect what you say you're expecting. We blew past the Turing-completeness threshold decades ago with general purpose computers, and we've combined them with planning algorithms in lots of ways.

Take AIXI, which uses the full power of Turing-completeness to do model-based planning with every possible abstraction/model. To my knowledge, switching over to that kind of fully-general planning (or any of its bounded approximations) hasn't actually produced corresponding improvements in quality of outputs, especially compared to the quality gains we get ... (read more)

5Thane Ruthenis3y

I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry. Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together. The general-intelligence algorithm is what allows all of this to be useful. A non-GI mind can't make use of a newly-constructed concept, because its planning machinery has no idea what to do with it: its policy function doesn't accept objects of this type, hasn't been adapted for them. This makes them unable to learn autonomously, unable to construct heuristics autonomously, and therefore unable to construct new concepts autonomously. General intelligence, by contrast, is a planning algorithm that "scales as fast as the world-model": a planning algorithm that can take in any concept that's been created this way. Or, an alternative framing... [...] General intelligence is an algorithm for systematic derivation of such "other changes". Does any of that make sense to you?

I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry.

Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together.

Hmm I feel like you're underestimating animal cognition / overestimating how much of what human... (read more)

2Thane Ruthenis3y

Agreed, I think. I'm claiming that those abilities are mutually dependent. Turing-completeness allows to construct novel abstractions like language/culture/etc., but it's only useful if there's a GI algorithm that can actually take these novelties in as inputs. Otherwise, there's no reason to waste compute deriving ahead of time abstractions you haven't encountered yet and won't know how to use; may as well wait until you run into them "in the wild". In turn, the GI algorithm is (as you point out) only shines if there's extant machinery that's generating novel abstractions for it to plan over. Otherwise, it can do no better than trial-and-error learning.

5cfoster03y

I guess I don't see much support for such mutual dependence. Other animals have working memory + finite state control, and learn from experience in flexible ways. It appears pretty useful to them despite the fact they don't have language/culture. The vast majority of our useful computing is done by systems that have Turing-completeness but not language/cultural competence. Language models sure look like they have language ability without Turing-completeness and without having picked up some "universal planning algorithm" that would render our previous work w/ NNs ~useless. Why choose a theory like "the capability gap between humans and other animals is because the latter is missing language/culture and also some binary GI property" over one like "the capability gap between humans and other animals is just because the latter is missing language/culture"? IMO the latter is simpler and better fits the evidence.

4Thane Ruthenis3y

Hmm, we may have reached the point from which we're not going to move on without building mathematical frameworks and empirically testing them, or something. [...] "Learn from experience" is the key point. Abstract thinking allows to learn without experience — from others' experience that they communicate to you, or from just figuring out how something works abstractly and anticipating the consequences in advance of them occurring. This sort of learning, I claim, is only possible when you have the machinery for generating entirely novel abstractions (language, math, etc.), which in turn is only useful if you have a planning algorithm capable of handling any arbitrary abstraction you may spin up. "The capability gap between humans and other animals is because the latter is missing language/culture and also some binary GI property" and "the capability gap between humans and other animals is just because the latter is missing language/culture" are synonymous, in my view, because you can't have language/culture without the binary GI property. [...] As per the original post, I disagree that they have the language ability in the relevant sense. I think they're situated firmly on the Simulacrum Level 4; they appear to communicate, but it's all just reflexes.

5cfoster03y

I didn't mean "learning from experience" to be restrictive in that way. Animals learn by observing others & from building abstract mental models too. But unless one acquires abstracted knowledge via communication, learning requires some form of experience: even abstracted knowledge is derived from experience, whether actual or imagined. Moreover, I don't think that some extra/different planning machinery was required for language itself, beyond the existing abstraction and model-based RL capabilities that many other animals share. But ultimately that's an empirical question. [...] Yeah I am probably going to end my part of the discussion tree here. My overall take remains: * There may be general purpose problem-solving strategies that humans and non-human animals alike share, which explain our relative capability gains when combined with the unlocks that came from language/culture. * We don't need any human-distinctive "general intelligence" property to explain the capability differences among human-, non-human animal-, and artificial systems, so we shouldn't assume that there's any major threshold ahead of us corresponding to it.

1Mateusz Bagiński2y

I would expect to see sophisticated ape/early-hominid-lvl culture in many more species if that was the case. For some reason humans went on the culture RSI trajectory whereas other animals didn't. Plausibly there was some seed cognitive ability (plus some other contextual enablers) that allowed a gene-culture "coevolution" cycle to start.

6Noosphere891y

Nowadays, I think the main reason humans took off is because human hands were extremely suited for tool use and being at range, which means that there is a selection effect at both the genetic level for more general intelligence and a selection effect on cultures for more cultural learning, and animals just mostly lack this by default, meaning that their intelligence is way less relevant than their lack of good actuators for tool use.

1Noosphere893y

Nitpick, but it actually isn't the final threshold of computation, though the things that would allow you to compute beyond a Turing Machine are basically cases where we are majorly wrong on the physical laws of the universe, or we somehow have a way to control the fundamental physical constants and/or laws of the universe, and the computers that can legitimately claim to go beyond Turing Machines with known physics aren't useful computers due to the No Free Lunch theorems. Just worth keeping that in mind.

3interstice3y

Non-sequitur, the no-free-lunch theorems don't have anything to do with the physical realizability of hypercomputers.

1Noosphere893y

The point is that a random Turing Machine's output is technically uncomputable, which is nice, but it's entirely useless because it uses an entirely flat prior, because it entirely picks randomly from all possible universes, and a No Free Lunch argument can be deployed to show why this isn't useful, because it picks at random from all possible universes/functions. This, incidentally resolves gedymin's question on the difference between a random hypercomputer and a useful hypercomputer: A useful hypercomputer trades off performance for certain functions/universes in order to do better in other functions/universes, while a random hypercomputer doesn't do that and thus is useless.

2interstice3y

What do you mean? The output of any Turing machine is computable by definition. Do you mean solving the halting problem for a random Turing machine? Or a random oracle?

3cfoster03y

Fair. I think this is indeed a nitpick. 😊 In case it wasn't clear, the point remains something like: When we observe/build computational systems in our world that are "better" along some axis than other systems, that "betterness" is not generally derived from having gone over a new threshold of "even more general" computation (they definitely aren't deriving it from hypercomputation, and often aren't even deriving it from universal Turing computation), but through being better suited to the capability in question.

I think my main problem with this is that it isn't based on anything. Countless times, you just reference other blog posts, which reference other blog posts, which reference nothing. I fear a whole lot of people thinking about alignment are starting to decouple themselves from reality. It's starting to turn into the AI version of String Theory. You could be correct, but given the enormous number of assumptions your ideas are stacked on (and that even a few of those assumptions being wrong leads to completely different conclusions), the odds of you even being in the ballpark of correct seem unlikely.

I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.

I think my main problem with this is that it isn't based on anything

Hm. I wonder if I can get past this common reaction by including a bunch of references to respectable psychology/neurology/game-theory experiments, which "provide scientific evidence" that various common-sensical properties of humans are actually real? Things like fluid vs. general intelligence, g-factor, the global workplace theory, situations in which humans do try to behave approximately like rational agents... There probably also are some psychology-survey results demonstrating stuff like "yes, humans do commonly report wanting to be consistent in their decision-making rather than undergoing wild mood swings and acting at odds with their own past selves", which would "provide evidence" for the hypothesis that complex minds want their utilities to be coherent.

That's actually an interesting idea! This is basically what my model is based on, after a fashion, and it makes arguments-from-introspection "legible" instead of seeming to be arbitrary philosophical n... (read more)

7Prometheus2y

This isn't what I mean. It doesn't mean you're not using real things to construct your argument, but that doesn't mean the structure of the argument reflects something real. Like, I kind of imagine it looking something like a rationalist Jenga tower, where if one piece gets moved, it all crashes down. Except, by referencing other blog posts, it becomes a kind of Meta-Jenga: a Jenga tower composed of other Jenga towers. Like "Coherent decisions imply consistent utilities". This alone I view to be its own mini Jenga tower. This is where I think String Theorists went wrong. It's not that humans can't, in theory, form good reasoning based on other reasoning based on other reasoning and actually arrive at the correct answer, it's just that we tend to be really, really bad at it. The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model, and any other model either doesn't predict it as accurately, or is more complex than yours.

2Thane Ruthenis2y

My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena. Again, the drive for consistent decision-making is a good example. Common-sensically, I don't think we'd disagree that humans want their decisions to be consistent. They don't want to engage in wild mood swings, they don't want to oscillate wildly between which career they want to pursue or whom they want to marry: they want to figure out what they want and who they want to be with, and then act consistently with these goals in the long term. Even when they make allowances for changing their mind, they try to consistently optimize for making said allowances: for giving their future selves freedom/optionality/resources. Yet it's not something e. g. the Shard Theory would naturally predict out-of-the-box, last I checked. You'd need to add structures on top of it until it basically replicates my model (which is essentially how I arrived at my model, in fact – see this historical artefact).

3Prometheus2y

"My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena." Such as? I wouldn't call Shard Theory mainstream, and I'm not saying mainstream models are correct either. On human's trying to be consistent decision-makers, I have some theories about that (and some of which are probably wrong). But judging by how bad humans are at it, and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this, even if the hardware is very stubborn at times. But even that isn't optimized too hard toward coherence. Someone might prefer pizza to hot dogs, but they probably won't always choose pizza over any other food, just because they want their preference ordering of food to be consistent. And, sure, maybe what they "truly" value is something like health, but I imagine even if they didn't, they still wouldn't do this. But all of this is still just one piece on the Jenga tower. And we could debate every piece in the tower, and even get 90% confidence that every piece is correct... but if there are more than 10 pieces on the tower, the whole thing is still probably going to come crashing down. (This is the part where I feel obligated to say, even though I shouldn't have to, that your tower being wrong doesn't mean "everything will be fine and we'll be safe", since the "everything will be fine" towers are looking pretty Jenga-ish too. I'm not saying we should just shrug our shoulders and embrace uncertainty. What I want is to build non-Jenga-ish towers)

2Thane Ruthenis2y

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis[1]). [...] Roughly agree, yeah. [...] I kinda want to push back against this repeat characterization – I think quite a lot of my model's features are "one storey tall", actually – but it probably won't be a very productive use of the time of either of us. I'll get around to the "find papers empirically demonstrating various features of my model in humans" project at some point; that should be a more decent starting point for discussion. [...] Agreed. Working on it. 1. ^ Which, yeah, I think is false: scaling LLMs won't get you to AGI. But it's also kinda unfalsifiable using empirical methods, since you can always claim that another 10x scale-up will get you there.

3Prometheus2y

It tends not to get talked about much today, but there was the PDP (connectionist) camp of cognition vs. the camp of "everything else" (including ideas such as symbolic reasoning, etc). The connectionist camp created a rough model of how they thought cognition worked, a lot of cognitive scientists scoffed at it, Hinton tried putting it into actual practice, but it took several decades for it to be demonstrated to actually work. I think a lot of people were confused by why the "stack more layers" approach kept working, but under the model of connectionism, this is expected. Connectionism is kind of too general to make great predictions, but it doesn't seem to allow for FOOM-type scenarios. It also seems to favor agents as local optima satisficers, instead of greedy utility maximizers.

8habryka2y

Hmm, I feel sad about this kind of critique. Like, this comment invokes some very implicit standard for posts, without making it at all explicit. Of course neither this post nor the posts they link to are literally "not based on anything". My guess is you are invoking an implicit standard for work to be "empirical" in order to be "anything", but that also doesn't really make sense since there are a lot of empirical arguments in this article and in the linked articles. I think highlighting any specific assumption, or even some set of assumptions that you think is fragile would be helpful. Or being at all concrete about what you would consider work that is "anything". But I think as it stands I find it hard to get much out of comments like this.

5habryka2y

(Please don't leave both top-level reacts and inline reacts of the same type on comments, that produces somewhat clearly confusing summary statistics. We might make it literally impossible, but until then, pick one and stick to it)

Even after thinking through these issues in SERI-MATS, and already agreeing with at least most of this post, I was surprised upon reading it how many new-or-newish-to-me ideas and links it contained.

I'm not sure if that's more of a failure of me, or of the alignment field to notice "things that are common between a diverse array of problems faced". Kind of related to my hunch that multiple alignment concepts ("goals", "boundaries", "optimization") will turn out to be isomorphic to the same tiny-handful of mathematical objects.

On this take, especially with your skepticism of LLM fluid intelligence and generality, is there much reason to expect AGI to be coming any time soon? Will it require design breakthroughs?

3Thane Ruthenis3y

I expect it to require a non-trivial design breakthrough, yes. I do not expect it to require many breakthroughs, or for it to take much longer once the breakthrough is made — see the "agency overhang" concerns. And there's a lot of money being poured into AI now and a lot of smart people tirelessly looking for insights... <10 years, I'd expect, assuming no heavy AI regulation/nuclear war/etc. Though, for all I know, some stupidly simple tweak to the current paradigm will be sufficient, and it may already be published in a paper somewhere, and now that OpenAI has stopped playing with scale and is actively looking for new ideas — for all I know, they may figure it out tomorrow.

1MichaelStJules3y

If they have zero fluid intelligence now, couldn't it be that building fluid intelligence is actually very hard and we're probably a long way off, maybe decades? It sounds like we've made almost no progress on this, despite whatever work people have been doing. There could still be a decent probability of AGI coming soon, and that could be enough to warrant acting urgently (or so could non-AGI, e.g. more task-specific AI used to engineer pathogens).

2Thane Ruthenis3y

Suppose that some technology requires 10 components to get to work. Over the last decades, you've seen people gradually figure out how to build each of these components, one by one. Now you're looking at the state of the industry, and see that we know how to build 9 of them. Do you feel that the technology is still a long time away, because we've made "zero progress" towards figuring out that last component? Advancements along the ML paradigm were not orthogonal to progress to AGI. On the contrary: they've set up things so that figuring out fluid intelligence/agency is potentially the last puzzle piece needed. A different angle: these advancements have lowered the bar for how well we need to understand fluid intelligence to get to AGI. If before, we would've needed to develop a full formal theory of cognition that we may leverage to build a GOFAI-style AGI, now maybe just some regularizer applied to a transformer on a "feels right" hunch will suffice.

1MichaelStJules3y

This seems pretty underspecified, so I don't know, but I wouldn't be very confident it's close: 1. Am I supposed to assume the difficulty of the last component should reflect the difficulty of the previous ones? 2. I'm guessing you're assuming the pace of building components hasn't been decreasing significantly. I'd probably grant you this, based on my impression of progress in AI, although it could depend on what specific components you have in mind. 3. What if the last component is actually made up of many components? I agree with the rest of your comment, but it doesn't really give me much reason to believe it's close, rather than just closer than before/otherwise.

2Thane Ruthenis3y

Yeah, it was pretty underspecified, I was just gesturing at the idea. Even more informally: Just look at GPT-4. Imagine that you're doing it with fresh eyes, setting aside all the fancy technical arguments. Does it not seem like it's almost there? Whatever the AI industry is doing, it sure feels like it's moving in the right direction, and quickly. And yes, it's possible that the common sense is deceptive here; but it's usually not. Or, to make a technical argument: The deep-learning paradigm is a pretty broad-purpose trick. Stochastic gradient descent isn't just some idiosyncratic method of training neural networks; it's a way to automatically generate software that meets certain desiderata. And it's compute-efficient enough to generate software approaching human brains in complexity. Thus, I don't expect that we'll need to move beyond it to get to AGI — general intelligence is reachable by doing SGD over some architecture. I expect we'll need advancement(s) on the order of "fully-connected NN -> transformers", not "GOFAI -> DL".

3MichaelStJules3y

I would say it seems like it's almost there, but it also seems to me to already have some fluid intelligence, and that might be why it seems close. If it doesn't have fluid intelligence, then my intuition that it's close may not be very reliable.

Might this paradigm be tested by measuring LLM fluid intelligence?

I predict that a good test would show that current LLMs have modest amounts of fluid intelligence, and that LLM fluid intelligence will increase in ways that look closer to continuous improvement than to a binary transition from nothing to human-level.

I'm unclear whether it's realistic to get a good enough measure of fluid intelligence to resolve this apparent crux, but I'm eager to pursue any available empirical tests of AI risk.

I agree with some of this, although I'm doubtful that the transition from sub-AGI to AGI is as sharp as outlined. I don't think that's impossible though, and I'd rather not take the risk. I do think it's possible to dumb down an AGI if you still have enough control over it to do things like inject noise into its activations between layers...

I'm hopeful that we can solve alignment iff we can contain and study a true AGI. Here's a comment I wrote on another post about the assumptions which give me hope we might manage alignment...

It seems to me like one of t... (read more)

I see some value in the framing of "general intelligence" as a binary property, but it also doesn't quite feel as though it fully captures the phenomenon. Like, it would seem rather strange to describe GPT4 as being a 0 on the general intelligence scale.

I think maybe a better analogy would be to consider the sum of a geometric sequence.

Consider the sum for a few values of r as it increases at a steady rate.

0.5 - 2a
0.6 - 2.5a
0.7 - 3.3a
0.8 - 5a
0.9 - 10a
1 - Diverges to infinity

What we see then is quite significant returns to increases in r and then a sudden d... (read more)

I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.

Unfortunately, I mostly agree with these positions. I'm afraid a lack of attention to these claims may be making the new wave of alignment thinkers more optimistic than is realistic. I do partially disagree with some of these, and that makes my p(doom) a good bit lower than the MIRI 99%. But it's not enough to make me truly optimistic. My p(doom) is right around the 50% "who knows" mark.

I'l... (read more)

1RogerDearnaley3y

This is a major crux for me, and one of the primary reasons my P(DOOM) isn't >90%. If you use value learning, you only need to get your value learner aligned well enough for it to a) start inside the region of convergence to true human values (i.e. it needs some passable idea what the words 'human' and 'values' mean and what the definition of "human values" is, like any small LLM has), and b) not kill everyone while it's learning the details, and it will do its research and Bayesianly converge on human values (and if it's not capable of competently being Bayesian enough to do that, it's not superhuman, at least at STEM). So, if you use value learning, the only piece you need to get exactly right (for outer alignment) is the phrasing of the terminal goal saying "Use value learning". For something containing an LLM, I think that might be about one short paragraph of text, possibly with one equation in it. The prospect of getting one paragraph of text and one equation right, with enough polishing and peer review, doesn't actually seem that daunting to me.

Your definition of general intelligence would include SGD on large neural networks. It is able to generalize from very few examples, learn and transform novel mathematical objects, be deployed on a wide variety of problems, and so on. Though it seems a pretty weak form of general intelligence, like evolution or general function optimization algorithms. Though perhaps its less general than evolution and less powerful than function optimization algorithms.

If we take this connection at face-value, we can maybe use SGD as a prototypical example for general int... (read more)

2Thane Ruthenis3y

I don't count it in, actually. In my view, the boundaries of the algorithm here aren't "SGD + NN", but "the training loop" as a whole, which includes the dataset and the loss/reward function. A general intelligence implemented via SGD, then, would correspond to an on-line training loop that can autonomously (without assistance from another generally-intelligent entity, like a human overseer) learn to navigate any environment. I don't think any extant training-loop setup fits this definition. They all need externally-defined policy gradients. If the distribution on which they're trained changes significantly, the policy gradient (loss/reward function) would need to be changed to suit — and that'd need to be done by something external to the training loop, which already understands the new environment (e. g., the human overseer) and knows how the policy gradient needs to be adapted to keep the system on-target. (LLMs trained via SSL are a degenerate case: in their case the prediction gradient = the policy gradient. They also can't autonomously generalize to generating new classes of text without first being shown a carefully curated dataset of such texts. They're not an exception.)

2Garrett Baker3y

I’m skeptical that locating the hyperparameters you mention is an AGI-complete task.

I agree completely about AGI being like Turing completeness, that there's a threshold. However, there are programming languages that are technically Turing complete, but only a masochist would actually try to use. So there could be a fire alarm, while the AGI is still writing all the (mental analogs of) domain-specific languages and libraries it needs. My evidence for this is humans: we're over the threshold, but barely so, and it takes years and years of education to turn us into quantum field theorist or aeronautical engineer.

But my main crux is that I t... (read more)

In-context learning in LLMs maps fairly well onto the concept of fluid intelligence. There are several papers now indicating that general learning algorithms emerge in LLMs to facilitate in-context learning.

2Thane Ruthenis3y

I assume you're talking about things like that? * These papers probably don't mean what they seem to. * Even if they did, it's not the right type of "general learning algorithm", in my view. See here, plus a paragraph in Section 6 about how "general in the limit" doesn't mean "actually reaches generality in finite time with finite data". I'll grant that it does have a spooky vibe.

Do you think you could find or develop a test of fluid intelligence that LLMs would fail to demonstrate any fluid intelligence in and generally do worse than the vast majority of humans on?

See here, starting from "consider a scheme like the following". In short: should be possible, but seems non-trivially difficult.

Do you think LLMs haven't developed general problem-solving heuristics by seeing lots and lots of problems across domains as well as plenty of fluid intelligence test questions and answers? Wouldn't that count as fluid intelligence?

I think forci... (read more)

3MichaelStJules3y

(Your reply is in response to a comment I deleted, because I thought it was basically a duplicate of this one, but I'd be happy if you'd leave your reply up, so we can continue the conversation.) [...] That seems like a high bar to me for testing for any fluid intelligence, though, and the vast majority of humans would do about as bad or worse (but possibly because of far worse crystallized intelligence). Similarly, in your post, "No scientific breakthroughs, no economy-upturning startup pitches, certainly no mind-hacking memes." I would say to look at it based on definitions and existing tests of fluid intelligence. These are about finding patterns and relationships between unfamiliar objects and any possible rules relating to them, applying those rules and/or inference rules with those identified patterns and relationships, and doing so more or less efficiently. More fluid intelligence means noticing patterns earlier, taking more useful steps and fewer useless steps. Some ideas for questions: 1. Invent new games or puzzles, and ask it to achieve certain things from a given state. 2. Invent new mathematical structures (e.g. new operations on known objects, or new abstract algebraic structures based on their axioms) and ask the LLM to reason about them and prove theorems (that weren't too hard to prove yourself or for someone else to prove). 3. Ask it to do hardness proofs (like NP-hardness proofs), either between two new problems, or just with one problem (e.g. ChatGPT proved a novel problem was NP-hard here). 4. Maybe other new discrete math problems. 5. EDIT: New IMO and Putnam problems. My impression is that there are few cross-applicable techniques in these areas, and the ones that exist often don't get you very far to solving problems. To do NP-hardness proofs, you need to identify patterns and relationships between two problems. The idea of using "gadgets" is way too general and hides all of the hard work, which is finding the right gadget to us

2Thane Ruthenis3y

Yup, that's also my current best guess for how this sort of test must look like. * Choose some really obscure math discipline D, one that we're pretty sure lacks much practical applications (i. e., won't be convergently learned from background data about the world). * Curate the AI's dataset to only include information up to some point in time T1. * Guide the AI step-by-step (as in, via chain-of-thought prompting or its equivalent) through replicating all discoveries made in D between T1 and the present T2. Pick the variables such that the inferential gap between D(T1) and D(T2) is large (can't be cleared by a non-superintelligent logical leap), but the gaps between individual insights are tiny. This would ensure that our AI would only be able to reach D(T2) if it's able to re-use its insights (i. e., build novel abstractions, store them in the context window/short-term memory, fluidly re-use them when needed), while not putting onerous demands on how good each individual insight must be. See also. I should probably write up a post about it, and maybe pitch this project to the LTFF or something.

It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.

Seems very overconfident if not plain wrong; consider as an existence proof that 'mathematicians score higher on tests of autistic traits, and have higher rates of diagnosed autism, compared with people in the general population' and c... (read more)

5localdeity3y

Interesting point. Though I suspect—partly using myself as an example (I scored 33 on the Autism Spectrum Quotient, and for math I'll mention qualifying for USAMO 3 times)—that these autistic mathematician types, while disinclined to be deceptive (likely finding it abhorrent, possibly having strong ethical stances about it), are still able to reason about deception in the abstract: e.g. if you give them logic puzzles involving liars, or detective scenarios where someone's story is inconsistent with some of the evidence, they'll probably do well at them. Or, if you say "For April Fool's, we'll pretend we're doing X", or "We need to pretend to the Nazis that we're doing X", they can meticulously figure out all the details that X implies and come up with plausible justifications where needed. In other words, although they're probably disinclined to lie and unpracticed at it, if they do decide to do it, I think they can do it, and there are aspects of constructing a plausible, mostly-consistent lie that they're likely extremely good at.

3Bogdan Ionut Cirstea3y

Thanks for your comment and your perspective, that's an interesting hypothesis. My intuition was that worse performance at false belief inference -> worse at deception, manipulation, etc. As far as I can tell, this seems mostly born out by a quick Google search e.g. Autism and Lying: Can Autistic Children Lie?, Exploring the Ability to Deceive in Children with Autism Spectrum Disorders, People with ASD risk being manipulated because they can't tell when they're being lied to, Strategic Deception in Adults with Autism Spectrum Disorder.

2Thane Ruthenis3y

My opinion is that it's caused by internal limitations placed on the general-intelligence component (see footnote 2). Autistic people can reason about deception formally, same as anybody, but they can't easily translate that understanding into practical social acumen, because humans don't have write-access to their instincts/shards/System 1. And they have worse instincts in the social domain to begin with because of... genes that codify nonstandard reward/reinforcement circuitry, I assume? Suppose that in a median person, there's circuitry that reinforces cognition that is upstream of some good social consequences, like making a person smile. That gradually causes the accumulation of crystallized-intelligence structures/shards specialized for social interactions. Autistic people lack this signal, or receive weaker reinforcement from it[1]. Thus, by default, they fail to develop much System-1 expertise for this domain. They can then compensate for it by solving the domain "manually" using their fully general intelligence. They construct good heuristics, commit them to memory, and learn to fire them when appropriate — essentially replicating by-hand the work that's done automatically in the neurotypical people's case. Or so my half-educated guess goes. I don't have much expertise here, beyond reading some Scott Alexander. @cfoster0, want to weigh in here? As to superintelligent AGIs, they would be (1) less limited in their ability to directly rewrite their System-1-equivalent (their GI components would have more privileges over their minds), (2) much better at solving domains "manually" and generating heuristics "manually". So even if we do hamstring our AGI's ability to learn e. g. manipulation skills, it'll likely be able to figure them out on its own, once it's at human+ level of capabilities. 1. ^ Reminder that reward is not the optimization target. What I'm stating here is not exactly "autistic people don't find social interactions pleasant so they

5localdeity3y

I've seen several smart autistic people on the internet say some variation of "they learn to emulate in software what normal people do in hardware, and that's how they manage to navigate social life well enough". Essentially as you describe. And I'd add that a major component of high intelligence likely means being good at doing things "in software" (approx. "system 2").

Upvoted for clarifying a possibly important crux. I still have trouble seeing a coherent theory here.

I can see a binary difference between Turing-complete minds and lesser minds, but only if I focus on the infinite memory and implicitly infinite speed of a genuine Turing machine. But you've made it clear that's not what you mean.

When I try to apply that to actual minds, I see a wide range of abilities at general-purpose modeling of the world.

Some of the differences in what I think of as general intelligence are a function of resources, which implies a fair... (read more)

2Thane Ruthenis3y

Hm, I think your objections are mostly similar to the objections cfoster0 is raising in this thread, so in lieu of repeating myself, I'll just link there. Do point out if I misunderstood and some of your points are left unaddressed.

What ties it all together is the belief that the general-intelligence property is binary.

Do any humans have the general-intelligence property?

If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)

If no, why do we believe the general-intelligence property exists?

7Thane Ruthenis3y

Yes, ~all of them. Humans are not superintelligent because despite their minds embedding the algorithm for general intelligence, that algorithm is still resource-constrained (by the brain's compute) and privilege-constrained within the mind (e. g., it doesn't have full write-access to our instincts). There's no reason to expect that AGI would naturally "stall" at the exact same level of performance and restrictions. On the contrary: even if we resolve to check for "AGI-ness" often, with the intent of stopping the training the moment our AI becomes true AGI but still human-level or below it, we're likely to miss the right moment without advanced interpretability tools, and scale it past "human-level" straight to "impossible-to-ignore superintelligent". There would be no warning signs, because "weak" AGI (human-level or below) can't be clearly distinguished from a very capable pre-AGI system, based solely on externally-visible behaviour. See Section 5 for more discussion of all of that. [...] Quoting from my discussion with cfoster0: [...]

9Rohin Shah3y

Sorry, I seem to have missed the problems mentioned in that section on my first read. [...] I'm not claiming the AGI would stall at human level, I'm claiming that on your model, the discontinuity should have some decent likelihood of ending at or before human level. (I care about this because I think it cuts against this point: We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die. In particular it seems like if the discontinuity ends before human level then you can iterate on alignment.) [...] Why isn't this also true of the weak AGI? Current models cannot autonomously get more compute (humans have to give it to them) or perform gradient descent on their own weights (unless the humans specifically try to make that happen); most humans placed in the models' position would not be able to do that either. It sounds like your answer is that the development of AGI could lead to something below-human-level, that wouldn't be able to get itself more compute / privileges, but we will not realize that it's AGI, so we'll give it more compute / privileges until it gets to "so superintelligent we can't do anything about it". Is that correct? [...] ... Huh. How do you know that humans are generally intelligent? Are you relying on introspection on your own cognitive process, and extrapolating that to other humans? What if our policy is to scale up resources / privileges available to almost-human-level AI very slowly? Presumably after getting to a somewhat-below-human-level AGI, with a small amount of additional resources it would get to a mildly-superhuman-level AI, and we could distinguish it then? Or maybe you're relying on an assumption that the AGI immediately becomes deceptive and successfully hides the fact that it's an AGI?

4Thane Ruthenis3y

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two? [...] Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts). The strategy of slowly scaling our AI up is workable at the core, but IMO there are a lot of complications: * A "mildly-superhuman" AGI, or even just a genius-human AGI, is still be an omnicide risk (see also). I wouldn't want to experiment with that; I would want it safely at average-human-or-below level. It's likely hard to "catch" it at that level by inspecting its external behavior, though: can only be reliably done via advanced interpretability tools. * Deceptiveness (and manipulation) is a significant factor, as you've mentioned. Even just a mildly-superhuman AGI will likely be very good at it. Maybe not implacably good, but it'd be like working bare-handed with an extremely dangerous chemical substance, with the entire humanity at the stake. * The problem of "iterating" on this system. If we have just a "weak" AGI on our hands, it's mostly a pre-AGI system, with a "weak" general-intelligence component that doesn't control much. Any "naive" approaches, like blindly training interpretability probes on it or something, would likely ignore that weak GI component, and focus mainly on analysing or shaping heuristics/shards. To get the right kind of experience from it, we'd have to very precisely aim our experiments at the GI component — which, again, likely requires advanced interpretability tools. Basically, I think we need to catch the AGI-ness while it's an "asymptomatic" stage, because the moment it becomes visible it's likely already incredibly dangerous (if not necessarily maximally dangerous). [...] More or less, plus the theoretical arg

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):

Stalling:

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

Are you imagining systems that are built differently from today? Because I'm not seeing how SGD could give the GI component an ability to rewrite the weights or get more compute given today's architectures and training regimes.

(Unless you mean "SGD enhances the GI component until the GI component is able to hack into the substrate it is running on to access the memory containing its own weights, which it can then edit", though I feel like it is inaccurate to summarize this as "SGD give it more privileges", so probably you don't mean that)

(Or perhaps you mean "SGD creates a set of weights that effectively treats the input English tokens as a programming language by which the network's behavior can be controlled, and the GI componen... (read more)

4Thane Ruthenis3y

Ah, makes sense. [...] I do expect that some sort of ability to reprogram itself at inference time will be ~necessary for AGI, yes. But I also had in mind something like your "SGD creates a set of weights that effectively treats the input English tokens as a programming language" example. In the unlikely case that modern transformers are AGI-complete, I'd expect something on that order of exoticism to be necessary (but it's not my baseline prediction). [...] "Doing science" is meant to be covered by "lack of empirical evidence that there's anything in the universe that humans can't model". Doing science implies the ability to learn/invent new abstractions, and we're yet to observe any limits to how far we can take it / what that trick allows us to understand. [...] Mmm. Consider a scheme like the following: * Let T2 be the current date. * Train an AI on all of humanity's knowledge up to a point in time T1, where T1<T2. * Assemble a list D of all scientific discoveries made in the time period (T1;T2]. * See if the AI can replicate these discoveries. At face value, if the AI can do that, it should be considered able to "do science" and therefore AGI, right? I would dispute that. If the period (T1;T2] is short enough, then it's likely that most of the cognitive work needed to make the leap to any discovery in D is already present in the data up to T1. Making a discovery from that starting point doesn't necessarily require developing new abstractions/doing science — it's possible that it may be done just by interpolating between a few already-known concepts. And here, some asymmetry between humans and e. g. SOTA LLMs becomes relevant: * No human knows everything the entire humanity knows. Imagine if making some discovery in D by interpolation required combining two very "distant" concepts, like a physics insight and advanced biology knowledge. It's unlikely that there'd be a human with sufficient expertise in both, so a human will likely do it by actual-s

6Rohin Shah3y

Okay, this mostly makes sense now. (I still disagree but it no longer seems internally inconsistent.) Fwiw, I feel like if I had your model, I'd be interested in: 1. Producing tests for general intelligence. It really feels like there should be something to do here, that at least gives you significant Bayesian evidence. For example, filter the training data to remove anything talking about <some scientific field, e.g. complexity theory>, then see whether the resulting AI system can invent that field from scratch if you point it at the problems that motivated the development of the field. 2. Identifying "dangerous" changes to architectures, e.g. inference time reprogramming. Maybe we can simply avoid these architectures and stick with things that are more like LLMs. 3. Hardening the world against mildly-superintelligent AI systems, so that you can study them / iterate on them more safely. (Incidentally, I don't buy the argument that mildly-superintelligent AI systems could clearly defeat us all. It's not at all clear to me that once you have a mildly-superintelligent AI system you'll have a billion mildly-superintelligent-AI-years worth of compute to run them.)

2Thane Ruthenis3y

I agree that those are useful pursuits. [...] Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

7Rohin Shah3y

Oh, I disagree with your core thesis that the general intelligence property is binary. (Which then translates into disagreements throughout the rest of the post.) But experience has taught me that this disagreement tends to be pretty intractable to talk through, and so I now try just to understand the position I don't agree with, so that I can notice if its predictions start coming true. You mention universality, active adaptability and goal-directedness. I do think universality is binary, but I expect there are fairly continuous trends in some underlying latent variables (e.g. "complexity and generality of the learned heuristics"), and "becoming universal" occurs when these fairly continuous trends exceed some threshold. For similar reasons I think active adaptability and goal-directedness will likely increase continuously, rather than being binary. You might think that since I agree universality is binary that alone is enough to drive agreement with other points, but: 1. I don't expect a discontinuous jump at the point you hit the universality property (because of the continuous trends), and I think it's plausible that current LLMs already have the capabilities to be "universal". I'm sure this depends on how you operationalize universality, I haven't thought about it carefully. 2. I don't think that the problems significantly change character after you pass the universality threshold, and so I think you are able to iterate prior to passing it.

4Thane Ruthenis3y

Interesting, thanks. [...] Agreed that this point (universality leads to discontinuity) probably needs to be hashed out more. Roughly, my view is that universality allows the system to become self-sustaining. Prior to universality, it can't autonomously adapt to novel environments (including abstract environments, e. g. new fields of science). Its heuristics have to be refined by some external ground-truth signals, like trial-and-error experimentation or model-based policy gradients. But once the system can construct and work with self-made abstract objects, it can autonomously build chains of them — and that causes a shift in the architecture and internal dynamics, because now its primary method of cognition is iterating on self-derived abstraction chains, instead of using hard-coded heuristics/modules.

4Rohin Shah3y

I agree that there's a threshold for "can meaningfully build and chain novel abstractions" and this can lead to a positive feedback loop that was not previously present, but there will already be lots of positive feedback loops (such as "AI research -> better AI -> better assistance for human researchers -> AI research") and it's not clear why to expect the new feedback loop to be much more powerful than the existing ones. (Aside: we're now talking about a discontinuity in the gradient of capabilities rather than of capabilities themselves, but sufficiently large discontinuities in the gradient of capabilities have much of the same implications.)

4Thane Ruthenis3y

Yeah, the argument here would rely on the assumption that e. g. the extant scientific data already uniquely constraint some novel laws of physics/engineering paradigms/psychological manipulation techniques/etc., and we would be eventually able to figure them out even if science froze right this moment. In this case, the new feedback loop would be faster because superintelligent cognition would be faster than real-life experiments. And I think there's a decent amount of evidence for this. Consider that there are already narrow AIs that can solve protein folding more efficiently than our best manually-derived algorithms — which suggests that better algorithms are already uniquely constrained by the extant data, and we've just been unable to find them. Same may be true for all other domains of science — and thus, a superintelligence iterating on its own cognition would be able to outspeed human science.

2TAG3y

It's still quite possible that we a smarter than octopi, but not at some ceiling of intelligence.

4Thane Ruthenis3y

In the hypothetical where there's no general intelligence, there's no such thing as "smarter", only "has a cognitive algorithm specialized for X". If so, it's weird that there are no animals with cognitive algorithms that we lack; it's weird that we can model any animal's cognition, that we basically have duplicates of all of their cognitive machinery. On the other hand, if there is such a thing as general intelligence in the sense of "can model anything", the explanation of why we can model any animal is straightforward.

3PeterMcCluskey3y

It sure looks like many species of animals can be usefully compared as smarter than others. The same is true of different versions of LLMs. Why shouldn't I conclude that most of those have what you call general intelligence?

About your opinion on LLMs probably not scaling to general intelligence:

What if the language of thought hypothesis is correct, human intelligence can be represented as rules that manipulate natural language, the context window of LLMs is going to become long enough to match a human's "context window", and LLM training is able to find the algorithm?

How does this view fits into your model? What probabilities do you assign to the various steps?

language of thought hypothesis is correct
language of thought close enough to natural language
context window becom

... (read more)

3Thane Ruthenis3y

* I do think human thought can be represented as language-manipulation rules, but that's not a very interesting claim. Natural language is Turing-complete, of course anything can be approximated as rules for manipulating it. The same is true for chains of NAND gates. p→1. * I don't think it's close to natural language in the meaningful sense. E. g., you can in fact think using raw abstractions, without an inner monologue, and it's much faster (= less compute-intensive) in some ways. I expect that's how we actually think, and the more legible inner monologue is more like a trick we're using to be able to convey our thoughts to other humans on the fly. A communication tool, not a cognition tool. Trying to use it for actual cognition will be ruinously compute-intensive. p>0.98. * "Is the context window long enough?" seems like the wrong way to think about it. If we're to draw a human analogue, the context window would mirror working memory, and in this case, I expect it's already more "roomy" than human working memory (in some sense). The issue is that LLMs can't update their long-term memory (and no, on-line training ain't the answer to it). If we're limited to using the context window, then its length would have to be equivalent to a human's life... In which case, sure, interesting things may happen in an LLM scaled so far, but this seems obviously computationally intractable. p→1. * Inasmuch as NNs can approximate any continuous function (and chain-of-thought prompting can allow arbitrary-depth recursion) — sure, transformers have general intelligence in their search-space, p→1. * ... but the current training schemes, or any obvious tweaks to them, won't be able to find it. This one I'm actually uncertain about, p≈0.7.

3rotatingpaguro3y

I know very logorrheic people who assert to think mostly verbally. Personally, I do a small amount of verbal thought, but sometimes resort to explicit verbal thinking on purpose to tackle problems I'm confused about. I think it would be sufficient that there exist some people who mostly reason verbally for the thesis to be valid for the purpose of guessing if LLMs are a viable path to intelligence. Do you think that even the most verbally-tuned people are actually doing the heavy lifting of their high-level thinking wordlessly? [...] I expect that "plug-ins" that give a memory to the LLM, as people are already trying to develop, are viable. Do you expect otherwise? (Although they would not allow the LLM to learn new "instincts".)

3Thane Ruthenis3y

Yes. It's a distinction similar to whatever computations happen in LLM forward-passes vs. the way Auto-GPT exchanges messages with its subagents. Maybe it's also a memory aid, such that memorizing the semantic representation of a thought serves as a shortcut to the corresponding mental state; but it's not the real nuts-and-bolts of cognition. The heavily lifting is done by whatever process figures out what word to put next in the monologue; not by the inner monologue itself. [...] I think the instincts are the more crucial part, yes; perhaps I should've said "long-term adaptation" rather than "long-term memory". I do suspect the current training processes fundamentally shape LLMs' architecture the wrong way, and not in a way that's easy to fix with fine-tuning, or conceptually-small architectural adjustments, or plug-ins. But that's my weakest claim, the one I'm only ~70% confident it. We'll see, I suppose.

3rotatingpaguro3y

It seems you use "monologue" in this sentence to refer to the sequence of words only, and then say that of course the monologue is not the cognition. With this I agree, but I don't think that's the correct interpretation of the combo "language of thought hypothesis" + "language of thought close to natural language". Having a "language of thought" means that there is a linear stream of items, and that your abstract cognition works only by applying some algorithm to the stream buffer to append the next item. The tape is not the cognition, but the cognition can be seen as acting (almost) only on the tape. Then "language of thought close to natural language" means that the language of thought has a short encoding in natural language. You can picture this as the language of thought of a verbal thinker being a more abstract version of natural language, similarly to when you feel what to say next but lack the word.

3Thane Ruthenis3y

... If not for the existence of non-verbal cognition, which works perfectly well even without a "tape". Suggesting that the tape isn't a crucial component, that the heavy lifting can be done by the abstract algorithm alone, and therefore that even in supposed verbal thinkers, that algorithm is likely what's doing the actual heavy lifting. In my view, there's an actual stream of abstract cognition, and a "translator" function mapping from that stream to human language. When we're doing verbal thinking, we're constantly running the translator on our actual cognition, which has various benefits (e. g., it's easier to translate our thoughts to other humans); but the items in the natural-language monologue are compressed versions of the items in the abstract monologue, and they're strictly downstream of the abstract stream.

3rotatingpaguro3y

So you think 1. There's a "stream" of abstract thought, or "abstract monologue" 2. The cognition algorithm operates on/produces the abstract stream 3. Natural language is a compressed stream of the abstract stream Which seems to me the same thing I said above, unless maybe you are also implying either or both of these additional statements: a) The abstract cognition algorithm can not be seen as operating mostly autoregressively on its "abstract monologue"; b) The abstract monologue can not be translated to a longer, but boundedly longer, natural language stream (without claiming that this is what happens typically when someone verbalizes). Which of (a), (b) do you endorse, eventually with amendments?

5Thane Ruthenis3y

I don't necessarily endorse either. But "boundedly longer" is what does a lot of work there. As I'd mentioned, cognition can also be translated into a finitely long sequence of NAND gates. The real question isn't "is there a finitely-long translation?", but how much longer that translation is. And I'm not aware of any strong evidence suggesting that natural language is close enough to human cognition that the resultant stream would not be much longer. Long enough to be ruinously compute-intensive (effectively as ruinous as translating it into NAND-gate sequences). Indeed, I'd say there's plenty of evidence to the contrary, given how central miscommunication is to the human experience.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Human intelligence is Turing-complete

That may be true, but it isn't an argument for general intelligence in itself.

Theres a particular problem in that the more qualitatively flexible part of the mind...the conscious mind, or system 2...is very limited on its ability to follow a programme, only being able to follow tens of steps reliably. Whereas system 1 is much more powerful but much less flexible.

A general intelligence may also be suppressed by an instinct firing off, as sometimes happens with humans. But that’s a feature of the wider mind the GI is embedded in, not of general intelligence itself.

I actually think you should count that as evidence against your claim that humans are General Intelligences.

Qualitatively speaking, human cognition is universally capable.

How would we know if this wasn't the case? How can we test this claim?
My initial reaction here is to think "We don't know what we don't know".

9Thane Ruthenis3y

Thanks! Appreciate that you were willing to go through with this exercise.

I would expect to observe much greater diversity in cognitive capabilities of animals, for humans to generalize poorer, and for the world overall to be more incomprehensible to us.
[...]
we'd look at the world, and see some systemic processes that are not just hard to understand, but are fundamentally beyond reckoning.

5Thane Ruthenis3y

I think I am confused where you're thinking the "binary/sharp threshold" is.

If you're talking about...

... an architectural change → Turing machines and their neural equivalents, for example, over, say, DFAs and simple associative memories. There is a binary threshold going from non-general to general architectures, where the latter can support programs/algorithms that the former cannot emulate. This includes whatever programs implement "understanding an arbitrary new domain" as you mentioned. But once we cross that very minimal threshold (namely, combining memory with finite state control), remaining improvements come mostly from increasing memory capacity and finding better algorithms to run, neither of which are a single binary threshold. Humans and many non-human animals alike seem to have similarly general architectures, and likewise general artificial architectures have existed for a long time, so I would say "there indeed is a binary/sharp threshold [in architectures] but it

... (read more)

an architectural change → Turing machines and their neural equivalents

This, yes. I think I see where the disconnect is, but I'm not sure how to bridge it. Let's try...

To become universally capable, a system needs two things:

"Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
"General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.

... (read more)

Ok I think this at least clears things up a bit.

To become universally capable, a system needs two things:
"Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
"General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.
General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.

5Thane Ruthenis3y

I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry.

Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together.

Hmm I feel like you're underestimating animal cognition / overestimating how much of what human... (read more)

2Thane Ruthenis3y

5cfoster03y

4Thane Ruthenis3y

5cfoster03y

1Mateusz Bagiński2y

6Noosphere891y

1Noosphere893y

3interstice3y

Non-sequitur, the no-free-lunch theorems don't have anything to do with the physical realizability of hypercomputers.

1Noosphere893y

2interstice3y

What do you mean? The output of any Turing machine is computable by definition. Do you mean solving the halting problem for a random Turing machine? Or a random oracle?

3cfoster03y

I'm very sympathetic to this view, but I disagree. It is based on a wealth of empirical evidence that we have: on data regarding human cognition and behavior.

I think my main problem with this is that it isn't based on anything

7Prometheus2y

2Thane Ruthenis2y

3Prometheus2y

2Thane Ruthenis2y

3Prometheus2y

8habryka2y

5habryka2y

On this take, especially with your skepticism of LLM fluid intelligence and generality, is there much reason to expect AGI to be coming any time soon? Will it require design breakthroughs?

3Thane Ruthenis3y

1MichaelStJules3y

2Thane Ruthenis3y

1MichaelStJules3y

2Thane Ruthenis3y

3MichaelStJules3y

Might this paradigm be tested by measuring LLM fluid intelligence?

I'm unclear whether it's realistic to get a good enough measure of fluid intelligence to resolve this apparent crux, but I'm eager to pursue any available empirical tests of AI risk.

I'm hopeful that we can solve alignment iff we can contain and study a true AGI. Here's a comment I wrote on another post about the assumptions which give me hope we might manage alignment...

It seems to me like one of t... (read more)

Consider the sum for a few values of r as it increases at a steady rate.

0.5 - 2a
0.6 - 2.5a
0.7 - 3.3a
0.8 - 5a
0.9 - 10a
1 - Diverges to infinity

What we see then is quite significant returns to increases in r and then a sudden d... (read more)

I think this is insightful pointing correctly to a major source of bifurcation in p(doom) estimates. I view this as the old guard vs. new wave perspectives on alignment.

I'l... (read more)

1RogerDearnaley3y

If we take this connection at face-value, we can maybe use SGD as a prototypical example for general int... (read more)

2Thane Ruthenis3y

2Garrett Baker3y

I’m skeptical that locating the hyperparameters you mention is an AGI-complete task.

2Thane Ruthenis3y

Do you think you could find or develop a test of fluid intelligence that LLMs would fail to demonstrate any fluid intelligence in and generally do worse than the vast majority of humans on?

See here, starting from "consider a scheme like the following". In short: should be possible, but seems non-trivially difficult.

Do you think LLMs haven't developed general problem-solving heuristics by seeing lots and lots of problems across domains as well as plenty of fluid intelligence test questions and answers? Wouldn't that count as fluid intelligence?

I think forci... (read more)

3MichaelStJules3y

2Thane Ruthenis3y

It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.

5localdeity3y

3Bogdan Ionut Cirstea3y

2Thane Ruthenis3y

5localdeity3y

Upvoted for clarifying a possibly important crux. I still have trouble seeing a coherent theory here.

When I try to apply that to actual minds, I see a wide range of abilities at general-purpose modeling of the world.

Some of the differences in what I think of as general intelligence are a function of resources, which implies a fair... (read more)

2Thane Ruthenis3y

What ties it all together is the belief that the general-intelligence property is binary.

Do any humans have the general-intelligence property?

If yes, after the "sharp discontinuity" occurs, why won't the AGI be like humans (in particular: generally not able to take over the world?)

If no, why do we believe the general-intelligence property exists?

7Thane Ruthenis3y

9Rohin Shah3y

4Thane Ruthenis3y

Hm? "Stall at the human level" and "the discontinuity ends at or before the human level" reads like the same thing to me. What difference do you see between the two?

Discontinuity ending (without stalling):

Stalling:

Basically, except instead of directly giving it privileges/compute, I meant that we'd keep training it until the SGD gives the GI component more compute and privileges over the rest of the model (e. g., a better ability to rewrite its instincts).

4Thane Ruthenis3y

6Rohin Shah3y

2Thane Ruthenis3y

I agree that those are useful pursuits. [...] Mind gesturing at your disagreements? Not necessarily to argue them, just interested in the viewpoint.

7Rohin Shah3y

4Thane Ruthenis3y

4Rohin Shah3y

4Thane Ruthenis3y

2TAG3y

It's still quite possible that we a smarter than octopi, but not at some ceiling of intelligence.

4Thane Ruthenis3y

3PeterMcCluskey3y

About your opinion on LLMs probably not scaling to general intelligence:

How does this view fits into your model? What probabilities do you assign to the various steps?

language of thought hypothesis is correct
language of thought close enough to natural language
context window becom

... (read more)

3Thane Ruthenis3y

3rotatingpaguro3y

3Thane Ruthenis3y

3rotatingpaguro3y

3Thane Ruthenis3y

3rotatingpaguro3y

5Thane Ruthenis3y

Human intelligence is Turing-complete

That may be true, but it isn't an argument for general intelligence in itself.

A general intelligence may also be suppressed by an instinct firing off, as sometimes happens with humans. But that’s a feature of the wider mind the GI is embedded in, not of general intelligence itself.

I actually think you should count that as evidence against your claim that humans are General Intelligences.

Qualitatively speaking, human cognition is universally capable.

How would we know if this wasn't the case? How can we test this claim?
My initial reaction here is to think "We don't know what we don't know".

101

A Case for the Least Forgiving Take On Alignment

101

Ω 39

1. Introduction

2. Why Believe This?

3. Is "General Intelligence" a Thing?

4. What Is "General Intelligence"?

5. A Caveat

6. The Case of LLMs

7. The Subsequent Difficulties

8. Closing Thoughts

101

Ω 39

101

Ω 39