Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Quick note: I occasionally run into arguments of the form "my research advances capabilities, but it advances alignment more than it advances capabilities, so it's good on net". I do not buy this argument, and think that in most such cases, this sort of research does more harm than good. (Cf. differential technological development.)

For a simplified version of my model as to why:

  • Suppose that aligning an AGI requires 1000 person-years of research.
    • 900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
    • The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can't get any of those four parts done in less than 25 years).

In this toy model, a critical resource is serial time: if AGI is only 26 years off, then shortening overall timelines by 2 years is a death sentence, even if you're getting all 900 years of the "parallelizable" research done in exchange.

My real model of the research landscape is more complex than this toy picture, but I do in fact expect that serial time is a key resource when it comes to AGI alignment.

The most blatant case of alignment work that seems parallelizable to me is that of "AI psychologizing": we can imagine having enough success building comprehensible minds, and enough success with transparency tools, that with a sufficiently large army of people studying the alien mind, we can develop a pretty good understanding of what and how it's thinking. (I currently doubt we'll get there in practice, but if we did, I could imagine most of the human-years spent on alignment-work being sunk into understanding the first artificial mind we get.)

The most blatant case of alignment work that seems serial to me is work that requires having a theoretical understanding of minds/optimization/whatever, or work that requires having just the right concepts for thinking about minds. Relative to our current state of knowledge, it seems to me that a lot of serial work is plausibly needed in order for us to understand how to safely and reliably aim AGI systems at a goal/task of our choosing.

A bunch of modern alignment work seems to me to sit in some middle-ground. As a rule of thumb, alignment work that is closer to behavioral observations of modern systems is more parallelizable (because you can have lots of people making those observations in parallel), and alignment work that requires having a good conceptual or theoretical framework is more serial (because, in the worst case, you might need a whole new generation of researchers raised with a half-baked version of the technical framework, in order to get people who both have enough technical clarity to grapple with the remaining confusions, and enough youth to invent a whole new way of seeing the problem—a pattern which seems common to me in my read of the development of things like analysis, meta-mathematics, quantum physics, etc.).

As an egregious and fictitious (but "based on a true story") example of the arguments I disagree with, consider the following dialog:


Uncharacteristically conscientious capabilities researcher: Alignment is made significantly trickier by the fact that we do not have an artificial mind in front of us to study. By doing capabilities research now (and being personally willing to pause when we get to the brink), I am making it more possible to do alignment research.

Me: Once humanity gets to the brink, I doubt we have much time left. (For a host of reasons, including: simultaneous discovery; the way the field seems to be on a trajectory to publicly share most of the critical AGI insights, once it has them, before wisening up and instituting closure policies after it's too late; Earth's generally terrible track-record in cybersecurity; and a sense that excited people will convince themselves it's fine to plow ahead directly over the cliff-edge.)

Uncharacteristically conscientious capabilities researcher: Well, we might not have many sidereal years left after we get to the brink, but we'll have many, many more researcher years left. The top minds of the day will predictably be much more interested in alignment work when there's an actual misaligned artificial mind in front of them to study. And people will take these problems much more seriously once they're near-term. And the monetary incentives for solving alignment will be much more visibly present. And so on and so forth.

Me: Setting aside how I believe that the world is derpier than that: even if you were right, I still think we'd be screwed in that scenario. In particular, that scenario seems to me to assume that there is not much serial research labor needed to do alignment research.

Like, I think it's quite hard to get something akin to Einstein's theory of general relativity, or Grothendieck's simplification of algebraic geometry, without having some researcher retreat to a mountain lair for a handful of years to build/refactor/distill/reimagine a bunch of the relevant concepts.

And looking at various parts of the history of math and science, it looks to me like technical fields often move forwards by building up around subtly-bad framings and concepts, so that a next generation can be raised with enough technical machinery to grasp the problem and enough youth to find a whole new angle of attack, at which point new and better framings and concepts are invented to replace the old. "A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it" (Max Planck) and all that.

If you need the field to iterate in that sort of way three times before you can see clearly enough to solve alignment, you're going to be hard-pressed to do that in five years no matter how big and important your field seems once you get to the brink.

(Even the 25 years in the toy model above feels pretty fast, to me, for that kind of iteration, and signifies my great optimism in what humanity is capable of doing in a rush when the whole universe is on the line.)


It looks to me like alignment requires both a bunch of parallelizable labor and a bunch of serial labor. I expect us to have very little serial time (a handful of years if we're lucky) after we have fledgling AGI.

When I've heard the “two units of alignment progress for one unit of capabilities progress” argument, my impression is that it's been made by people who are burning serial time in order to get a bit more of the parallelizable alignment labor done.

But the parallelizable alignment labor is not the bottleneck. The serial alignment labor is the bottleneck, and it looks to me like burning time to complete that is nowhere near worth the benefits in practice.


Some nuance I'll add:

I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.

Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.

The main reasons I feel more positive about the agent-foundations-ish cases I know about are:

  • The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
  • I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
  • The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.[1]
  • Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.

Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”. I think that in a large majority of scenarios where humanity’s long-term future goes well, it mainly goes well because we made major alignment progress over the coming years and decades.[2] I don’t want this post to be taken as an argument against what I see as humanity’s biggest hope: figuring out AGI alignment.


 

  1. ^

    On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later.

    “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.

  2. ^

    "Decades" would require timelines to be longer than my median. But when I condition on success, I do expect we have more time.

145

Ω 58

26 comments, sorted by Click to highlight new comments since: Today at 4:47 PM
New Comment

First, I totally agree, and in general think that if the AI risk community can't successfully adopt a norm of "don't advance capabilities research" we basically can't really collaborate on anything.

I'm saddened to hear you mention Anthropic. I was holding out hope after SBF invested and looking at their site that they were the only present AGI company that was serious about not advancing the public state of capabilities research. Was that naive? Where do they fall along the Operational Adequacy spectrum?

I'd also be interested in hearing which parts of Anthropic's research output you think burns our serial time budget. If I understood the post correctly, then OP thinks that efforts like transformer circuits are mostly about accelerating parallelizable research.

Maybe OP thinks that

  • mechanistic interpretability does have little value in terms of serial research
  • RLHF does not give us alignment (because it doesn't generalize beyond the "sharp left turn" which OP thinks is likely to happen)
  • therefore, since most of Anthropic's alignment focused output has not much value in terms of serial research, and it does somewhat enhance present-day LLM capabilities/usability, it is net negative?

But I'm very much unsure whether OP really believes this -- would love to hear him elaborate.

ETA: It could also be the case that OP was exclusively referring to the part of Anthropic that is about training LLMs efficiently as a pre-requisite to study those models?

My mental filter for inclusion on that list was apparent prevalence of the "we can't do alignment until we have an AGI in front of us" meme. If a researcher has that meme and their host org is committed to not advancing the public capabilities frontier, that does ameliorate the damage, and Anthropic does seem to me to be doing the best on that front (hooray for Anthropic!). That said, my impression is that folks at Anthropic are making the tradeoffs differently from how I would, and my guess is that this is in part due to differences in our models of what's needed for alignment, in a fashion related to the topic of the OP.

Thanks for elaborating! In so far your assessment is based on in-person interactions, I can't really comment since I haven't spoken much with people from Anthropic.

I think there are degrees to believing this meme you refer to, in the sense of "we need an AI of capability level X to learn meaningful things". And I would guess that many people at Anthropic do believe this weaker version -- it's their stated purpose after all. And for some values of X this statement is clearly true, e.g. learned filters by shallow CNNs trained on MNIST are not interpretable, wheras the filters of deep Inception-style CNNs trained on ImageNet are (mostly) interpretable.

One could argue that parts of interpretabillity do need to happen in a serial manner, e.g. finding out the best way to interpret transformers at all, the recent SoLU finding, or just generally building up knowledge on how to best formalize or go about this whole interpretability business. If that is true, and furthermore interpretability turns out to be an important component in promising alignment proposals, then the question is mostly about what level of X gives you the most information to advance the serial interpretability research in terms of how much other serial budget you burn.

I don't know whether people at Anthropic believe the above steps or have thought about it in these ways at all but if they did this could possibly explain the difference in policies between you and them?

I disagree with that view, primarily due to my belief that the sharp left turn is an old remnant of the hard-takeoff view that was always physically problematic, and now that we actually have AI in a limited form, while there does seem to be a discontinuity at first, rarely will it get you all the way, and once we lose that first instance, progress is much more smooth and slow. So slow-takeoff is my mainline scenario for AI.

Finally, I think we will ultimately have to experiment, because being very blunt, humans are quite bad at reasoning from first principles or priors, and without feedback from reality, reasoning like a formal mathematician or first principles tends to be wildly wrong for real life.

I found this very persuasive and convincing provided that success looks only like a comprehensive theory of alignment.

In worlds where a non rigorous "craft" of alignment is "good enough" for capabilities within a certain range, I don't think this model/your conclusions necessarily hold.

And the dynamics of takeoff will also determine if having a craft of alignment that's good enough for par human AI but not superhuman AI is success (if takeoff is slow enough, you can use your aligned HLAI to help you figure out alignment for superhuman AI).

You can generalise this further to a notion of alignment escape velocity. You don't yet have a comprehensive theory of alignment that's robust to arbitrary capability amplification, but you have a craft of alignment that's robust to current/near term capabilities and can use current aligned AI to help develop your alignment craft for more capable agents, before those agents are then developed (this lead of alignment craft over capabilities needs to hold for "alignment escape velocity". In practice it probably looks like coordination to not develop more powerful AI than we know how to align.)

I expect much slower takeoff than EY (at least a few years between a village idiot and John Von Neumann, and even longer to transition from John Von Neumann to strongly superhuman intelligence [I expect marginal returns to cognitive investment to diminish considerably as you advance along the capabilities frontier]) , so I'm very sympathetic to alignment escape velocity.

 
(All of the above said, I'm still young [24], so I'll probably devote my career [at least the first decade of it] to the "figure out a comprehensive theory of alignment" thing. It optimises for personal impact on a brighter future [which is very important to me], and it sounds like a really fun problem [I like abstract thinking].)

Dan Hendrycks (note: my employer) and Mantas Mazeika have recently covered some similar ideas in their paper X-Risk Analysis for AI Research, which is aimed at the broader ML research community. Specifically, they argue for the importance of having minimal capabilities externalities while doing safety research and addresses some common counterarguments to this view including those similar to the ones you've described. The reasons given for this are somewhat different to your serial/parallel characterization, so I think serve as a good complement. The relevant part is here.

I kind of want you to get quantitative here? Like pretty much every action we take has some effect on AI timelines, but I think effect-on-AI-timelines is often swamped by other considerations (like effects on attitudes around those who will be developing AI).

Of course it's prima facie more plausible that the most important effect of AI research is the effect on timelines, but I'm actually still kind of sceptical. On my picture, I think a key variable is the length of time between when-we-understand-the-basic-shape-of-things-that-will-get-to-AGI and when-it-reaches-strong-superintelligence. Each doubling of that length of time feels to me like it could be worth order of 0.5-1% of the future. Keeping implemented-systems close to the technological-frontier-of-what's-possible could help with this, and may be more affectable than the

Note that I don't think this really factors into an argument in terms of "advancing alignment" vs "aligning capabilities" (I agree that if "alignment" is understood abstractly the work usually doesn't add too much to that). It's more like a DTD argument about different types of advancing capabilities.

I think it's unfortunate if that strategy looks actively bad on your worldview. But if you want to persuade people not to do it, I think you either need to persuade them of the whole case for your worldview (for which I've appreciated your discussion of the sharp left turn), or to explain not just that you think this is bad, but also how big a deal do you think it is. Is this something your model cares about enough to trade for in some kind of grand inter-worldview bargaining? I'm not sure. I kind of think it shouldn't be (that relative to the size of ask it is, you'd get a much bigger benefit from someone starting to work on things you cared about than stopping this type of capabilities research), but I think it's pretty likely I couldn't pass your ITT here.

On my picture, I think a key variable is the length of time between when-we-understand-the-basic-shape-of-things-that-will-get-to-AGI and when-it-reaches-strong-superintelligence. Each doubling of that length of time feels to me like it could be worth order of 0.5-1% of the future.

Amusingly, I expect that each doubling of that time is negative EV. Because that time is very likely negative.

Question, are you assuming time travel or acausality being a thing in the next 20-30 years due to FTL work? Because that's the only way that time from AGI understanding to superintelligence is negative at all.

No, I expect (absent agent foundations advances) people will build superintelligence before they understand the basic shape of things of which that AGI will consist. An illustrative example (though I don't think this exact thing will happen): if the first superintelligence popped out of a genetic algorithm, then people would probably have no idea what pieces went into the thing by the time it exists.

On my picture, I think a key variable is the length of time between when-we-understand-the-basic-shape-of-things-that-will-get-to-AGI and when-it-reaches-strong-superintelligence.

I don't understand why you think the sort of capabilities research done by alignment-conscious people contributes to lengthening this time. In particular, what reason do you have to think they're not advancing the second time point as much as the first? Could you spell that out more explicitly?

How do you think about empirical work on scalable oversight? A lot of scalable oversight methods do result in capabilities improvements if they work well. A few concrete examples where this might be the case:

  1. Learning from Human Feedback
  2. Debate
  3. Iterated Amplification
  4. Imitative Generalization

I'm curious which of the above you think it's net good/bad to get working (or working better) in practice. I'm pretty confused about how to think about work on the above methods; they're on the main line path for some alignment agendas but also advanced capabilities / reduce serial time to work on the other alignment agendas.

FWIW, I had a mildly negative reaction to this title. I agree with you, but I feel like the term "PSA" should be reserved for things that are really very straightforward and non-controversial, and I feel like it's a bit of a bad rhetorical technique to frame your arguments as a PSA. I like the overall content of the post, but feel like a self-summarizing post title like "Most AI Alignment research is not parallelizable" would be better.

(done)

I agree, and this assumes that the bottleneck is serial time, when this may not be the case.

I tend to agree that burning up the timeline is highly costly, but more because Effective Altruism is an Idea Machine that has only recently started to really crank up. There's a lot of effort being directed towards recruiting top students from uni groups, but these projects require time to pay off.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.

If it is the case that more people should go into Agent Foundations research then perhaps MIRI should do more to enable it?

I think the word "capabilities" is actually not quite what the word we're looking for here, because the part of "capabilities" that negatively affects timelines is pretty much just scaling work.  You can make a reasonable argument that the doomsday clock is based on how fast we get to an AI that's smarter than we are; which is to say, how quickly we end up being able to scale to however many neurons it takes to become smarter than humans.  And I don't expect that prompt research or intepretability research or OOD research is really going to meaningfully change that, or at least I don't recall having read any argument to the contrary.

Put another way: do you think empirical alignment research that doesn't increase capabilities can ever exist, even in principle?

EDIT: On reflection, I guess in addition to scaling that "agent goal-directedness" work would fall under the heading of things that are both (1) super dangerous for timelines and (2) super necessary to get any kind of aligned AI ever.  I do not know what to do about this.

Suppose that aligning an AGI requires 1000 person-years of research.

  • 900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
  • The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can't get any of those four parts done in less than 25 years).

 

Do you have a similar model for just building (unaligned) AGI? Or is the model meaningfully different? On a similar model for just building AGI, then timelines would mostly be shortened by progressing through the serial research-person-years instead of the parallelisable research-person-years. If researchers who are progressing both capabilities and aligning are doing both in the parallelisable part, then this would be less worrying, as they're not actually shortening timelines meaningfully.

 

Unfortunately I imagine you think that building (unaligned) AGI quite probably doesn't have many more serial person-years of research required, if any. This is possibly another way of framing the prosaic AGI claim: "we expect we can get to AGI without any fundamentally new insights on intelligence, using (something like) current methods."

I like the distinction between parallelizable and serial research time, and agree that there should be a very high bar for shortening AI timelines and eating up precious serial time.

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems. Insofar as this assumption doesn't hold (because our institutions fail, or because the knowledge about how to allocate researcher-hours itself depends on the outcomes of parallelizable research) the distinction between parallelizable and serial work breaks down and other considerations dominate.

One caveat to the claim that we should prioritize serial alignment work over parallelizable work, is that this assumes an omniscient and optimal allocator of researcher-hours to problems.

Why do you think it assumes that?

Sorry, I didn't mean to imply that these are logical assumptions necessary for us to prioritize serial work; but rather insofar as these assumptions don't hold, prioritizing work that looks serial to us is less important at the margin.

Spelling out the assumptions more:

  1. Omniscient meaning "perfect advance knowledge of what work will turn out to be serial vs parallelizable." In practice I think this is very hard to know beforehand - a lot of work that turned out to be part of the "serial bottleneck" looked parallelizable ex ante.
  2. Optimal meaning "institutions will actually allocate enough researchers to the problem in time for the parallelizable work to get done". Insofar as we don't expect this to hold, we will lose even if all the serial work gets done in time.

Also, on a re-read I notice that all the examples given in the post relate to mathematics or theoretical work, which is almost uniquely serial among human activities. By contrast, engineering disciplines are typically much more parallelizable, as evidenced by the speedup in technological progress during war-time.

This isn't a coincidence; the state of alignment knowledge is currently "we have no idea what would be involved in doing it even in principle, given realistic research paths and constraints", very far from being a well-specified engineering problem. Cf. https://intelligence.org/2013/11/04/from-philosophy-to-math-to-engineering/.

If you succeed at the framework-inventing "how does one even do this?" stage, then you can probably deploy an enormous amount of engineering talent in parallel to help with implementation, small iterative improvements, building-upon-foundations, targeting-established-metrics, etc. tasks.

I agree if the criterion is to get us to utopia, it's a problem (maybe not even that, but whatever), but if we instead say that it has to avoid x-risk, then we do have some options. My favorite research directions are IDA and HCH, with ELK a second option for alignment. We aren't fully finished on those ideas, but we do have at least some idea about what we can do.

Also, it's very unlikely theoretical work or mathematical work like provable alignment will do much, beyond toy problems here.

I agree if the criterion is to get us to utopia, it's a problem (maybe not even that, but whatever), but if we instead say that it has to avoid x-risk

This seems to misunderstand my view? My goal is to avoid x-risk, not to get us to utopia. (Or rather, my proximate goal is to end the acute risk period; ultimate goal utopia, but I want to pint nearly all of the utopia-work to the period after we've ended the acute risk period.

Cf., from Eliezer:

"When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone.  When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get.  So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it.  Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as "less than roughly certain to kill everybody", then you can probably get down to under a 5% chance with only slightly more effort.  Practically all of the difficulty is in getting to "less than certainty of killing literally everyone".  Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment.  At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.  Anybody telling you I'm asking for stricter 'alignment' than this has failed at reading comprehension.  The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors."

mathematical work like provable alignment will do much

I don't know what you mean by "do much", but if you think theory/math work is about "provable alignment" then you're misunderstanding all (or at least the vast majority?) of the theory/math work on alignment. "Is this system aligned?" is not the sort of property that admits of deductive proof, even if the path to understanding "how does alignment work?" better today routes through some amount of theorem-proving on more abstract and fully-formalizable questions.

New to LessWrong?