Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The slow takeoff hypothesis predicts that AGI emerges in a world where powerful but non-AGI AI is already a really big deal. Whether AI is a big deal right before the emergence of AGI determines many super basic things about what we should think our current job is. I hadn’t fully appreciated the size of this effect until a few days ago.

In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens. In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there. 

In the fast takeoff world, we’re much more like a normal research field–we want some technical problem to eventually get solved, so we try to solve it. But in the slow takeoff world, we’re basically in a weird collaboration across time with the more numerous, non-longtermist AI researchers who will be in charge of aligning their powerful AI systems but who we fear won’t be cautious enough in some ways or won’t plan ahead in some other ways. Doing technical research in the fast takeoff world basically just requires answering technical questions, while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when.

I think that these two perspectives are extremely different, and I think I’ve historically sometimes had trouble communicating with people who held the slow takeoff perspective because I didn’t realize we disagreed on basic questions about the conceptualization of the question. (These miscommunications persisted even after I was mostly persuaded of slow takeoffs, because I hadn’t realized the extent to which I was implicitly assuming fast takeoffs in my picture of how AGI was going to happen.)

As an example of this, I think I was quite confused about what genre of work various prosaic alignment researchers think they’re doing when they talk about alignment schemes. To quote a recent AF shortform post of mine:

Something I think I’ve been historically wrong about:

A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.

Similarly with debate--adversarial setups are pretty obvious and easy.

In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”.

It’s still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren’t used at all.

I think the slightly exaggerated slogan for this update of mine is “IDA is futurism, not a proposal”.

My current favorite example of the thinking-on-the-margin version of alignment research strategy is in this comment by Paul Christiano

116

Ω 47

24 comments, sorted by Click to highlight new comments since: Today at 5:44 PM
New Comment

I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.

In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we've seen multiple major orgs embracing it within the past few months.) As a result, AI takeover risk never looks much more obvious than it does now.

Concealed problems look like no problems, so there will in-general be economic incentives to train in ways which conceal problems. The most-successful-looking systems, at any given time, will be systems trained in ways which incentivize hidden problems over visible problems.

I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.

Examples of small AI catastrophes will also probably make takeover risk more obvious.

I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.

Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.

This prediction feels like... it doesn't play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)

On the other hand, sometimes Facebook's newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there's an echo chamber problem, people only see things they agree with. But from an individual customer's perspective, that's exactly what they (think they) want to see, they don't know that there's anything wrong with the information they're receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that's a much more economically stable state; Facebook is less eager to switch to a new metric.

... but even that isn't a real example of a problem which is properly invisible. It's still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don't notice at all, or don't know to attribute to the newsfeed algorithm at all. We don't have a widely-recognized example of such a thing and probably won't any time soon, precisely because most people do not notice it. Yet I'd be surprised if Facebook's newsfeed algorithm didn't have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.

If anything, I'd expect iterating on visible problems to produce additional subtle problems - for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that's much harder to detect, because it's wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn't make the echo chamber problem less bad, but it does make the echo chamber problem less visible.

Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that's not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren't visible.

I can’t tell which of two arguments you’re making: that there are unknown unknowns, or that myopia isn’t a complete solution.

This is a good argument for all metrics being Goodhearteable, and that if takeover occurs and the AI is incorrigible that’ll cause suboptimal value lock-in (Ie unknown unknowns).

I agree myopia isn’t a complete solution, but it seems better for preventing takeover risk than for preventing social media dysfunction? It seems more easily defineable in the worst case (“don’t do something nearly all humans really dislike” than “make the public square function well”).

Can you talk more about why RL4HF is “concealing problems”? Do you mean “attempting alignment” in a way that other people won’t, or something else?

Roughly, "avoid your actions being labelled as bad by humans [or models of humans]" is not quite the same signal as "don't be bad".

Ah ok, so you’re saying RL4HF is bad if it’s the action model. But it seems fine if it’s done to the reward model, right?

What do you mean by “RLHF is done to the reward model”, and why would that be fine?

You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF

I guess it depends on “how fast is fast and how slow is slow”, and what you say is true on the margin, but here's my plea that the type of thinking that says “we want some technical problem to eventually get solved, so we try to solve it” is a super-valuable type of thinking right now even if we were somehow 100% confident in slow takeoff. (This is mostly an abbreviated version of this section.)

  1. Differential Technological Development (DTD) seems potentially valuable, but is only viable if we know what paths-to-AGI will be safe & beneficial really far in advance. (DTD could take the form of accelerating one strand of modern ML relative to another—e.g. model-based RL versus self-supervised language models etc.—or it could take the form of differentially developing ML-as-a-whole compared to, I dunno, something else.) Relatedly, suppose (for the sake of argument) that someone finds an ironclad proof that safe prosaic AGI is impossible, and the only path forward is a global ban on prosaic AGI research. It would be way better to find that proof right now than finding it in 5 years, and better in 5 years than 10, etc., and that's true no matter how gradual takeoff is.
  2. We don't know how long safety research will take. If takeoff happens over N years, and safety research takes N+1 years, that's bad even if N is large.
    1. Maybe you'll say that almost all of the person-years of safety research will happen during takeoff, and any effort right now is a drop in the ocean compared to that. But I really think wall-clock time is an important ingredient in research progress, not just person-years. (“Nine women can't make a baby in a month.”)
  3. We don't just need to figure out the principles for avoiding AGI catastrophic accidents. We also need every actor with a supercomputer to understand and apply these principles. Some ideas take many decades to become widely (let alone universally) accepted—famous examples include evolution and plate tectonics. It takes wall-clock time for arguments to be refined. It takes wall-clock time for evidence to be marshaled. It takes wall-clock time for nice new pedagogical textbooks to be created. And of course, it takes wall-clock time for the stubborn holdouts to die and be replaced by the next generation. :-P

Some ideas take many decades to become widely (let alone universally) accepted—famous examples include evolution and plate tectonics.

One example that an AI policy person mentioned in a recent Q&A is "bias in ML" already being fairly much a consensus issue in ML and AI policy. I guess this happened in 5ish years?

I certainly wouldn't say that all correct ideas take decades to become widely accepted. For example, often somebody proves a math theorem, and within months there's an essentially-universal consensus that the theorem is true and the proof is correct.

Still, "bias in ML" is an interesting example. I think that in general, "discovering bias and fighting it" is a thing that everyone feels very good about doing, especially in academia and tech which tend to be politically left-wing. So the deck was stacked in its favor for it to become a popular cause to support and talk about. But that's not what we need for AGI safety. The question is not “how soon will almost everyone be saying feel-good platitudes about AGI safety and bemoaning the lack of AGI safety?”; the question is “how soon will AGI safety be like bridge-building safety, where there are established, universally-agreed-upon, universally-followed, legally-mandated, idiot-proof best practices?”. I don't think the "bias in ML" field is there yet. I'm not an expert, but my impression is that there is a lot of handwringing about bias in ML, and not a lot of established universal idiot-proof best practices about bias in ML. I think a lot of the discourse is bad or confused—e.g. people continue to cite the ProPublica report as a prime example of "bias in ML" despite the related impossibility theorem (see Brian Christian book chapter 2). I'm not even sure that all the currently-popular best practices are good ideas. For example, if there's a facial recognition system that's worse at black faces than white faces, my impression is that best practices are to diversify the training data so that it gets better at black faces. But it seems to me that facial recognition systems are just awful, because they enable mass surveillance, and the last thing we should be doing is making them better, and if they're worse at identifying a subset of the population then maybe those people are the lucky ones.

So by my standards, "bias in ML" is still a big mess, and therefore 5ish years hasn't been enough.

I think the ML bias folks are stuck with too hard a problem, since they’ve basically decided that all of justice can/should (or should not) be remedied through algorithms. As a result the technical folks have run into all the problems philosophy never solved, and so “policy” can only do the most obvious interventions (limit use of inaccurate facial recognition) which get total researcher consensus. (Not to mention the subfield is left-coded and thus doesn’t win the bipartisan natsec-tech crowd.) That said, 5 years was certainly enough to get their scholars heavily embedded throughout a presidential administration.

In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens.

I strongly disagree with that and I don't think it follows from the premise.  I think by most reasonable definitions of alignment it is already the case that most of the research is not done by x-risk motivated people. 

Furthermore, I think it reflects poorly on this community that this sort of sentiment seems to be common. 

It's possible that a lot of our disagreement is due to different definitions of "research on alignment", where you would only count things that (e.g.) 1) are specifically about alignment that likely scales to superintelligent systems, or 2) is motivated by X safety.  

To push back on that a little bit...
RE (1): It's not obvious what will scale, And I think historically this community has been too pessimistic (i.e. almost completely dismissive) about approaches that seem hacky or heuristic.  
RE (2): This is basically circular.

I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).

Examples would be interesting, certainly. Concerning the post's point, I'd say the relevant claim is that [type of alignment research that'll be increasingly done in slow takeoff scenarios] is already being done by non x-risk motivated people.

I guess the hope is that at some point there are clear-to-everyone problems with no hacky solutions, so that incentives align to look for fundamental fixes - but I wouldn't want to rely on this.

I also stumbled over this sentence.

1) I think even non-obvious issues can get much more research traction than AI safety does today. And I don't even think that catastrophic risks from AI are particularly non-obvious?

2) Not sure how broadly "cause the majority of research" is defined here, but I have some hope we can find ways to turn money into relevant research

In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there. 

 

Lets consider the opposite. Imagine you are programming a self driving car, in a simulated environment. You notice it goodhearting your metrics, so you tweak them and try again. You build up a list of 1001 ad hoc patches that makes your self driving car behave reasonably most of the time. 

The object level patches only really apply to self driving cars. They include things like a small intrinsic  preference towards looking at street signs. The meta level strategy of patching it until it works isn't very relevant either. 

Imagine a world with many AI's like this. All with ad hoc kludges of hard coded utility functions. The AI is becoming increasingly economically important and getting close to AGI. Slow takeoff. All the industrial work is useless.

In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.

TLDR: I think an important sub-question is 'how fast is agency takeoff' as opposed to economic/power takeoff in general.

There are a few possible versions of this in slow takeoff which look quite different IMO.

  1. Agentic systems show up before the end of the world and industry works to align these systems. Here's a silly version of this:

GPT-n prefers writing romance to anything else. It's not powerful enough to take over the world but it does understand it's situation, what training is etc. And it would take over the world if it could and this is somewhat obvious to industry. In practice it mostly tries to detect when it isn't in training and then steer outputs in a more romantic direction. Industry would like to solve this, but finetuning isn't enough and each time they've (naively) retrained models they just get some other 'quirky' behavior (but at least soft-core romance is better than that AI which always asks for crypto to be sent to various addresses). And adversarial training just results in getting other strange behavior.

Industry works on this problem because it's embarassing and it costs them money to discard 20% of completions as overly romantic. They also foresee the problem getting worse (even if they don't buy x-risk).

  1. Not obviously agentic systems have alignment problems, but we don't see obvious, near human level agency until the end of the world. This is slow takeoff world, so these systems are taking over a larger and larger fraction of the economy despite to being very agentic. These alignment issues could be reward hacking or just general difficulty getting language models to follow instructions to the best of their ability (as shows up currently).

I'd claim that in a world which is more centrally scenerio (2), industrial work on the 'alignment problem' might not be very useful for reducing existential risk in the same way that I think that a lot of current 'applied alignment'/instruction following/etc isn't very useful. So, this world goes similarly to fast takeoff in terms of research prioritization. But in something like scenerio (1), industry has to do more useful research and problems are more obvious.

while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when.
 

Example?

I’m not that excited for projects along the lines of “let’s figure out how to make human feedback more sample efficient”, because I expect that non-takeover-risk-motivated people will eventually be motivated to work on that problem, and will probably succeed quickly given motivation. (Also I guess because I expect capabilities work to largely solve this problem on its own, so maybe this isn’t actually a great example?) I’m fairly excited about projects that try to apply human oversight to problems that the humans find harder to oversee, because I think that this is important for avoiding takeover risk but that the ML research community as a whole will procrastinate on it.

in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.

I agree with this, and I think it extends beyond what you're describing here. In a slow takeoff world, the aspects of the alignment problem that show up in non-AGI systems will also provide EAs with a lot of information about what's going on, and I think we should try to do things now that will help us to notice those aspects and act appropriately. (I'm not sure what this looks like; maybe we want to build relationships with whoever will be building these systems, or maybe we want to develop methods for figuring things out and fixing problems that are likely to generalize.)