Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

tl;dr: There's two diametrically opposed failure modes an alignment researcher can fall into: engaging in excessively concrete research whose findings won't timely generalize to AGI, and engaging in excessively abstract research whose findings won't timely connect to the practical reality.

Different people's assessments of what research is too abstract/concrete differ significantly based on their personal AI-Risk models. One person's too-abstract can be another's too-concrete.

The meta-level problem of alignment research is to pick a research direction that, on your subjective model of AI Risk, strikes a good balance between the two – and thereby arrives at the solution to alignment in as few steps as possible.


Suppose that you're interested in solving AGI Alignment. There's a dizzying plethora of approaches to choose from:

So... How the hell do you pick what to work on?

The starting point, of course, would be building up your own model of the problem. What's the nature of the threat? What's known about how ML models work? What's known about agents, and cognition? How does any of that relate to the threat? What are all extant approaches? What's each approach's theory-of-impact? What model of AI Risk does it assume? Does it agree with your model? Is it convincing? Is it tractable?

Once you've done that, you'll likely have eliminated a few approaches as obvious nonsense. But even afterwards, there might still be multiple avenues left that all seem convincing. How do you pick between those?

Personal fit might be one criterion. Choose the approach that best suits your skills and inclinations and opportunities. But that's risky: if you make a mistake, and end up working on something irrelevant just because it suits you better, you'll have multiplied your real-world impact by zero. Conversely, contributing to a tractable approach would be net-positive, even if you'd be working at a disadvantage. And who knows, maybe you'll find that re-specializing is surprisingly easy!

So what further objective criteria can you evaluate?

Regardless of one's model of AI Risk, there's two specific, diametrically opposed failure modes that any alignment researcher can fall into: being too concrete, and being too abstract.

The approach to choose should be one that maximizes the distance from both failure modes.

The Scylla: Atheoretic Empiricism

One pitfall would be engaging in research that doesn't generalize to aligning AGI.

An ad-absurdum example: You pick some specific LLM model, then start exhaustively investigating how it responds to different prompts, and what quirks it has. You're building giant look-up tables of "query, response", with no overarching structure and no attempt to theorize about the model's internals.

A more realistic example: You've decided to build a detailed understanding of a specific LLM's functionality – i. e., you're exhaustively focusing on that one LLM. You're building itemized lists of its neurons, investigating what inputs seem to activate each the strongest, what functions they implement; you're looking for quirks in its psychology, and trying to build a full understanding of it.

Now, certainly, you're uncovering some findings that'd generalize to all LLMs. But there'd be some point at which more time spent investigating this specific model wouldn't yield much data about other LLMs; only data about this one. Thus, inasmuch as you'd be spending time on that, you'd be wasting the time you could be spending actually working on alignment.

A fairly controversial take: Studying LLMs-in-general might, likewise, fall prey to that. Studying them reveals some information about AIs-in-general, and cognitive-systems-in-general. But if LLMs aren't already AGIs, there would be a point at which more time spent studying LLM cognition, instead of searching for a new research topic, would only yield you information about LLMs; not about AGIs.

A fairly implausible possibility: Likewise, it's not entirely certain that Deep Learning is AGI-complete. If we live in such a world, then studying DL is worthwhile inasmuch as it yields information about cognitive-systems-shaped-by-selection-pressures-in-general. But at some point, you'll have learned everything DL can teach you about whatever paradigm would be AGI-complete. So the additional time spent researching DL would only yield information about an irrelevant AGI-incomplete paradigm.

The Charybdis: Out-of-Touch Theorizing

The diametrically opposite pitfall is engaging in overly theoretical research that will never connect to reality.

Ad absurdum: You might decide to start with the fundamental philosophical problems. Why does anything exist? What is the metaphysical nature of reality? Is reductionism really true? What's up with qualia? That line of research will surely eventually meander down to reality! It aims to answer all questions it is possible to answer, after all, and "how can I align an AGI?" is a question. Hence, you'll eventually solve alignment.

More realistically: You might decide to work on formalizing the theory of algorithms-in-general. How can those be embedded into other algorithms? How can they interact, and interfere on each other?

Since AGI agents could be viewed as algorithms, once you have a gears-level model of this topic – once you properly understand what an "embedded algorithm" is, say – you'll be able to tell what the hell an "AGI" is, as well. You'll be able to specify it in your framework, define proper constraints on how an algorithm implementing an "aligned" "AGI" would look like, then just incrementally narrow down the space of algorithms. Eventually, you'll arrive at one that corresponds to an aligned AGI – and then it's just a matter of typing up the code.

Controversial example: Agency-foundations research might be this. Sure, the AGI we'll get on our hands might end up approximately isomorphic to an idealized game-theoretic agent. But that "approximately" might be doing a lot of heavy lifting. It might be that idealized-agent properties correspond to real-AGI properties so tenuously as to yield no useful information, such that you would have been better off studying LLMs.

Implausible example: Actually, GPT-5, stored deep inside OpenAI's data centers, already reached AGI. It'll take off before this year is up. Everyone should focus on trying to align this specific model; aiming for the general understanding of agents or AI cognition or LLMs is excessive and wasteful.

The Shortest Path

As you can see, the failures lie on a spectrum, and they're model-dependent to boot.

That is: Depending on how you think AIs/cognition/AI risks work, the same approach could be either hopelessly non-generalizable, or concerned with generalities too vacuous to ever matter.

As an example, consider my own favoured agenda, building a theory of embedded world-models. If you think LLMs have already basically reached AGI, and just need to be scaled-up in order to take off, I'm being out-of-touch: whatever results I'm arriving at will not connect to reality in time for the takeoff. Conversely, if you're skeptical that "train up a world-model and align it by retargeting the search" would suffice to yield us robust alignment, if you think we'll need much more manual control over the design for alignment to hold, then I'm basically playing with toys.

I, however, obviously think that I'm striking just the right balance. An approach that is as concrete as possible while still being AGI Alignment-complete.

That's the target you should be aiming to hit, as well. A lowest-scope project that's nevertheless sufficient.

Let's take a step back. In theory, given unlimited time, basically-all approaches would actually converge to an AGI Alignment solution:

  • If you're starting bottom-up, from the most concrete problems, like studying a specific LLM... Well, eventually you'll have itemized all of its properties and grown bored, so you'll move on to a different LLM. Upon doing so, you'll discover that a lot of your previous findings generalize. The second LLM will be utterly comprehended by you much quicker. Repeat a few times, and you'll have build up a solid understanding of the whole scope of what the LLM architecture permits. So you'll do the obvious next thing, and move on to studying some different architecture. That'll be easier, with your mastery of LLMs. Once you've iterated on this pattern some more, and went through a few different architectures, and generalized from them – why, you'll likely end up understanding an AGI-complete architecture somewhere along the way as well.
  • If you're starting top-down, from the most abstract problems: Well, as I'd outlined in the ad-absurdum example there, you'll eventually reconnect to reality even if starting from the fundamental philosophy. Existential questions to phenomena-in-general to cognition-in-general to AGI Alignment, say.

The issue? Choosing the wrong starting point would lengthen your journey immensely. And the timer's ticking.

Our goal isn't just to solve AGI Alignment, it's to solve it as quickly as possible.

So be sure to deeply consider all options available, and make your choice wisely. And once you've made it, stay ready to pivot at a moment's notice if you spy an even shorter pathway.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 7:32 AM

From the position of uncertainty, there is no optimal direction, only a model of good distribution of global efforts among all directions. A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions. A choice of an actual researcher with specific preferences should give weight to those preferences, which might greatly improve productivity.

A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions

Inside-view convincingness of these directions still has to be weighted in. E. g., "study the Bible for alignment insights" is a relatively neglected direction (just Unsong on it, really?), but that doesn't mean it'd be sensible to focus on it just because it's neglected. And even if your marginal contributions to the correct approach would be minimal because so many other people are working on it, that may still be more expected impact than setting off on a neglected (and very likely incorrect) one.

A choice of an actual researcher with specific preferences should give weight to those preferences

Oh, I'm not saying entirely ignore your preferences/comparative advantages. But if you're looking at a bunch of plausible directions, you can pick between them not solely based on your comparative advantages.

A marginal choice of a spherical researcher without specific preferences should be based on identifying relatively neglected directions

Inside-view convincingness of these directions still has to be weighted in.

I mean directions neglected relative to estimated good distribution of global effort. If I estimate good distribution of effort towards searching The Silmarillion for insights relevant to mechanistic interpretability to be zero, then it's not a relatively neglected direction.

A choice of an actual researcher with specific preferences should give weight to those preferences

if you're looking at a bunch of plausible directions, you can pick between them not solely based on your comparative advantages

Sure, by "give weight" I mean take into account, not take as the sole basis for a decision. The other major factor is that relative neglectedness I mentioned (in the sense I hopefully now clarified).


As you can see, the failures lie on a spectrum, and they're model-dependent to boot.

And we can go further and say that the failures lie in a high-dimensional space, and that the apparent tradeoff is more a matter of finding the directions in which to pull the rope sideways. Propagating constraints between concepts and propositions is a way to go that seems hopeworthy to me. One wants to notice commonalities in how each of one's plans are doomed, and then address the common blockers / missing ideas. In other words, recurse to the "abstract" as much as is called for, even if you get really abstract; but treat [abstracting more than what you can directly see/feel as being demanded by your thinking] as a risky investment with opportunity cost.

Great post. I think this type of strategic thinking is too rare in alignment and other academic disciplines.

I'd just modify your bottom line a little bit. The goal isn't quite to "solve alignment as quickly as possible". The goal is to maximize the odds that we've solved alignment in time to prevent human disempowerment. That's importantly different.

It means solving alignment for the first type of AGI we build, before it's deployed.

Having an alignment solution that applies to some type of AGI nobody is building doesn't help. Yet a lot of otherwise brilliant alignment work goes in that direction, and IMO gets that big zero impact multiplier.

I think if you could demonstrably "solve alignment" for any architecture, you'd have a decent chance of convincing people to build it as fast as possible, in lieu of other avenues they had been pursuing.

Some people. But it would depend what the prospects were for that type of AGI. Because I don't think you could convince everyone else to stop working on other types of AGI. So it would be a race between the new "more alignable" type and the currently-leading types. If the "more alignable" type seemed guaranteed to lose that race, I'm not sure many people would even try building it.

I love this framing, particularly regarding the "shortest path". Reminds me of the "perfect step" described in the Kingkiller books:

Nothing I tried had any effect on her. I made Thrown Lighting, but she simply stepped away, not even bothering to counter. Once or twice I felt the brush of cloth against my hands as I came close enough to touch her white shirt, but that was all. It was like trying to strike a piece of hanging string.

I set my teeth and made Threshing Wheat, Pressing Cider, and Mother at the Stream, moving seamlessly from one to the other in a flurry of blows.

She moved like nothing I had ever seen. It wasn’t that she was fast, though she was fast, but that was not the heart of it. Shehyn moved perfectly, never taking two steps when one would do. Never moving four inches when she only needed three. She moved like something out of a story, more fluid and graceful than Felurian dancing.

Hoping to catch her by surprise and prove myself, I moved as fast as I dared. I made Maiden Dancing, Catching Sparrows, Fifteen Wolves . . .

Shehyn took one single, perfect step.


As I watched, gently dazed by the motion of the tree, I felt my mind slip lightly into the clear, empty float of Spinning Leaf. I realized the motion of the tree wasn’t random at all, really. It was actually a pattern made of endless changing patterns.

And then, my mind open and empty, I saw the wind spread out before me. It was like frost forming on a blank sheet of window glass. One moment, nothing. The next, I could see the name of the wind as clearly as the back of my own hand.

I looked around for a moment, marveling in it. I tasted the shape of it on my tongue and knew if desired I could stir it to a storm. I could hush it to a whisper, leaving the sword tree hanging empty and still.

But that seemed wrong. Instead I simply opened my eyes wide to the wind, watching where it would choose to push the branches. Watching where it would flick the leaves.

Then I stepped under the canopy, calmly as you would walk through your own front door. I took two steps, then stopped as a pair of leaves sliced through the air in front of me. I stepped sideways and forward as the wind spun another branch through the space behind me.

I moved through the dancing branches of the sword tree. Not running, not frantically batting them away with my hands. I stepped carefully, deliberately. It was, I realized, the way Shehyn moved when she fought. Not quickly, though sometimes she was quick. She moved perfectly, always where she needed to be.