This article was outlined by Nate Soares, inflated by Rob Bensinger, and then edited by Nate. Content warning: the tone of this post feels defensive to me. I don't generally enjoy writing in "defensive" mode, but I've had this argument thrice recently in surprising places, and so it seemed worth writing my thoughts up anyway.


In last year’s Ngo/Yudkowsky conversation, one of Richard’s big criticisms of Eliezer was, roughly, ‘Why the heck have you spent so much time focusing on recursive self-improvement? Is that not indicative of poor reasoning about AGI?’

I’ve heard similar criticisms of MIRI and FHI’s past focus on orthogonality and instrumental convergence: these notions seem obvious, so either MIRI and FHI must be totally confused about what the big central debates in AI alignment are, or they must have some very weird set of beliefs on which these notions are somehow super-relevant.

This seems to be a pretty common criticism of past-MIRI (and, similarly, of past-FHI); in the past month or so, I’ve heard it two other times while talking to other OpenAI and Open Phil people.

This argument looks misguided to me, and I hypothesize that a bunch of the misguidedness is coming from a simple failure to understand the relevant history.

I joined this field in 2013-2014, which is far from "early", but is early enough that I can attest that recursive self-improvement, orthogonality, etc. were geared towards a different argumentative environment, one dominated by claims like "AGI is impossible", "AGI won't be able to exceed humans by much", and "AGI will naturally be good".

A possible response: “Okay, but ‘sufficiently smart AGI will recursively self-improve’ and ‘AI isn’t automatically nice’ are still obvious. You should have just ignored the people who couldn’t immediately see this, and focused on the arguments that would be relevant to hypothetical savvy people in the future, once the latter joined in the discussion.”

I have some sympathy for this argument. Some considerations weighing against, though, are:

  • I think it makes more sense to filter on argument validity, rather than “obviousness”. What’s obvious varies a lot from individual to individual. If just about everyone talking about AGI is saying “obviously false” things (as was indeed the case in 2010), then it makes sense to at least try publicly writing up the obvious counter-arguments.
  • This seems to assume that the old arguments (e.g., in Superintelligence) didn’t work. In contrast, I think it’s quite plausible that “everyone with a drop of sense in them agrees with those arguments today” is true in large part because these propositions were explicitly laid out and argued for in the past. The claims we take as background now are the claims that were fought for by the old guard.
  • I think this argument overstates how many people in ML today grok the “obvious” points. E.g., based on a recent DeepMind Podcast episode, these sound like likely points of disagreement with David Silver.

But even if you think this was a strategic error, I still think it’s important to recognize that MIRI and FHI were arguing correctly against the mistaken views of the time, rather than arguing poorly against future views.

 

Recursive self-improvement

Why did past-MIRI talk so much about recursive self-improvement? Was it because Eliezer was super confident that humanity was going to get to AGI via the route of a seed AI that understands its own source code?

I doubt it. My read is that Eliezer did have "seed AI" as a top guess, back before the deep learning revolution. But I don't think that's the main source of all the discussion of recursive self-improvement in the period around 2008. 

Rather, my read of the history is that MIRI was operating in an argumentative environment where:

  • Ray Kurzweil was claiming things along the lines of ‘Moore’s Law will continue into the indefinite future, even past the point where AGI can contribute to AGI research.’ (The Five Theses, in 2013, is a list of the key things Kurzweilians were getting wrong.)
  • Robin Hanson was claiming things along the lines of ‘The power is in the culture; superintelligences wouldn’t be able to outstrip the rest of humanity.’

The memetic environment was one where most people were either ignoring the topic altogether, or asserting ‘AGI cannot fly all that high’, or asserting ‘AGI flying high would be business-as-usual (e.g., with respect to growth rates)’.

The weighty conclusion of the "recursive self-improvement" meme is not “expect seed AI”. The weighty conclusion is “sufficiently smart AI will rapidly improve to heights that leave humans in the dust”.

Note that this conclusion is still, to the best of my knowledge, completely true, and recursive self-improvement is a correct argument for it.

Which is not to say that recursive self-improvement happens before the end of the world; if the first AGI's mind is sufficiently complex and kludgy, it’s entirely possible that the cognitions it implements are able to (e.g.) crack nanotech well enough to kill all humans, before they’re able to crack themselves.

The big update over the last decade has been that humans might be able to fumble their way to AGI that can do crazy stuff before it does much self-improvement. 

(Though, to be clear, from my perspective it’s still entirely plausible that you will be able to turn the first general reasoners to their own architecture and get a big boost, and so there's still a decent chance that self-improvement plays an important early role. (Probably destroying the world in the process, of course. Doubly so given that I expect it’s even harder to understand and align a system if it’s self-improving.))

In other words, it doesn’t seem to me like developments like deep learning have undermined the recursive self-improvement argument in any real way. The argument seems solid to me, and reality seems quite consistent with it.

Taking into account its past context, recursive self-improvement was a super conservative argument that has been vindicated in its conservatism.

It was an argument for the proposition “AGI will be able to exceed the heck out of humans”. And AlphaZero came along and was like, “Yep, that’s true.”

Recursive self-improvement was a super conservative argument for “AI blows past human culture eventually”; when reality then comes along and says “yes, this happens in 2016 when the systems are far from truly general”, the update to make is that this way of thinking about AGI sharply outperformed, not that this way of thinking was silly because it talked about sci-fi stuff like recursive self-improvement when it turns out you can do crazy stuff without even going that far. As Eliezer put it, “reality held a more extreme position than I did on the Yudkowsky-Hanson spectrum”.

If arguments like recursive self-improvement and orthogonality seem irrelevant and obvious now, then great! Intellectual progress has been made. If we're lucky and get to the next stop on the train, then I’ll hopefully be able to link back to this post when people look back and ask why we were arguing about all these other silly obvious things back in 2022.

 

Deep learning

I think "MIRI staff spent a bunch of time talking about instrumental convergence, orthogonality, recursive self-improvement, etc." is a silly criticism. 

On the other hand, I think "MIRI staff were slow to update about how far deep learning might go" is a fair criticism, and we lose Bayes points here, especially relative to people who were vocally bullish about deep learning before late 2015 / early 2016.

In 2003, deep learning didn't work, and nothing else worked all that well either. A reasonable guess was that we'd need to understand intelligence in order to get unstuck; and if you understand intelligence, then an obvious way to achieve superintelligence is to build a simple, small, clean AI that can take over the hard work of improving itself. This is the idea of “seed AI”, as I understand it. I don’t think 2003-Eliezer thought this direction was certain, but I think he had a bunch of probability mass on it.[1]

I think that Eliezer’s model was somewhat surprised by humanity’s subsequent failure to gain much understanding of intelligence, and also by the fact that humanity was able to find relatively brute-force-ish methods that were computationally tractable enough to produce a lot of intelligence anyway.

But I also think this was a reasonable take in 2003. Other people had even better takes — Shane Legg comes to mind. He stuck his neck out early with narrow predictions that panned out. Props to Shane.

I personally had run-of-the-mill bad ideas about AI as late as 2010, and didn't turn my attention to this field until about 2013, which means that I lost a bunch of Bayes points relative to the people who managed to figure out in 1990 or 2000 that AGI will be our final invention. (Yes, even if the people who called it in 2000 were expecting seed AI rather than deep learning, back when nothing was really working. I reject the Copenhagen Theory Of Forecasting, according to which you gain special epistemic advantage from not having noticed the problem early enough to guess wrongly.)

My sense is that MIRI started taking the deep learning revolution much more seriously in 2013, while having reservations about whether broadly deep-learning-like techniques would be the first way humanity reached AGI. Even now, it’s not completely obvious to me that this will be the broad paradigm in which AGI is first developed, though something like that seems fairly likely at this point. But, if memory serves, during the Jan. 2015 Puerto Rico conference I was treating the chance of deep learning going all the way as being in the 10-40% range; so I don't think it would be fair to characterize me as being totally blindsided.

My impression is that Eliezer and I, at least, updated harder in 2015/16, in the wake of AlphaGo, than a bunch of other locals (and I, at least, think I've been less surprised than various other vocal locals by GPT, PaLM, etc. in recent years).

Could we have done better? Yes. Did we lose Bayes points? Yes, especially relative to folks like Shane Legg.

But since 2016, it mostly looks to me like with each AGI advancement, others update towards my current position. So I'm feeling pretty good about the predictive power of my current models.

Maybe this all sounds like revisionism to you, and your impression of FOOM-debate-era Eliezer was that he loved GOFAI and thought recursive self-improvement was the only advantage digital intelligence could have over human intelligence.

And, I wasn't here in that era. But I note that Eliezer said the opposite at the time; and the track record for such claims seems to hold more examples of “mistakenly rounding the other side’s views off to a simpler, more-cognitively-available caricature”, and fewer examples of “peering past the veil of the author’s text to see his hidden soul”.

Also: It’s important to ask proponents of a theory what they predict will happen, before crowing about how their theory made a misprediction. You're always welcome to ask for my predictions in advance.

(I’ve been making this offer to people who disagree with me about whether I have egg on my face since 2015, and have rarely been taken up on it. E.g.: yes, we too predict that it's easy to get GPT-3 to tell you the answers that humans label "aligned" to simple word problems about what we think of as “ethical”, or whatever. That’s never where we thought the difficulty of the alignment problem was in the first place. Before saying that this shows that alignment is actually easy contra everything MIRI folk said, consider asking some MIRI folk for their predictions about what you’ll see.)

 

  1. ^

    In particular, I think Eliezer’s best guess was AI systems that would look small, clean, and well-understood relative to the large opaque artifacts produced by deep learning. That doesn’t mean that he was picturing GOFAI; there exist a wide range of possibilities of the form “you understand intelligence well enough to not have to hand off the entire task to a gradient-descent-ish process to do it for you” that do not reduce to “coding everything by hand”, and certainly don’t reduce to “reasoning deductively rather than probabilistically”.

151

New Comment
62 comments, sorted by Click to highlight new comments since: Today at 6:03 PM

The weighty conclusion of the "recursive self-improvement" meme is not “expect seed AI”. The weighty conclusion is “sufficiently smart AI will rapidly improve to heights that leave humans in the dust”.

Note that this conclusion is still, to the best of my knowledge, completely true, and recursive self-improvement is a correct argument for it.

This whole discussion seems relevant to me because it feels like it keeps coming up when you and Eliezer talk about why prosaic AI alignment doesn't help, sometimes explicitly ("Even if this helped with capabilities produced by SGD, why would it help in the regime that actually matters?") and often because it just seems to be a really strong background assumption for you that leads to you having a very different concrete picture of what is going to happen.

It doesn't seem like recursive self-improvement is a cheap lower bound argument, it seems like you really think that what I think of as the "normal, boring" world just isn't going to happen. So I'm generally interested in talking about that and get clear about what you think is going on here, and hopefully get some predictions on the record.

This also gives me the sense that you feel quite strongly about your view of recursive self-improvement. If you had a 50% chance on "something like boring business as usual with SGD driving crucial performance improvements at the crucial time" then your dismissal of prosaic AI alignment seems strange to me.

(ETA: there's actually a lot going on here, I'd guess this is like 1/4th of the total disagreement.)

Robin Hanson was claiming things along the lines of ‘The power is in the culture; superintelligences wouldn’t be able to outstrip the rest of humanity.’

Worth noting that Robin seems to strongly agree that "recursive self-improvement" is going to happen, it's just that he has a set of empirical views for which that name sounds silly and it won't be as local or fast as Eliezer thinks.

Relatedly, Eliezer saying "Robin was wrong for doubting RSI; if other crazy stuff will happen before RSI then he's just even more wrong" seems wrong. In Age of Em I think Robin speculates that within a few years of the first brain emulations, there will be more alien AI systems which are able to double their own productivity within a few weeks (and then a few weeks later it will be even crazier)! That sure sounds like he's on board with the part of RSI that is obvious, and what he's saying is precisely that other crazy stuff will happen first, essentially that we will use computers to replace the hardware of brains before we replace the software. (The book came out in 2016 but I think Robin has had the basic outline of this view since 2012 or earlier.)

The big update over the last decade has been that humans might be able to fumble their way to AGI that can do crazy stuff before it does much self-improvement. 

This feels to me like it's still missing a key part of the disagreement, at least with people like me. As best I can tell/guess, this is also an important piece of the disagreement with Robin Hanson and with some of the OpenAI or OpenPhil people who don't like your discussion of recursive self-improvement.

Here's how the situation seems to me:

  • "Making AI better" is one of the activities humans are engaged in.
  • If AI were about as good as things at humans, then AI would be superhuman at "making AI better" at roughly the same time it was superhuman at other tasks.
  • In fact there will be a lot of dispersion, and prima facie we'd guess that there are a lot of tasks (say 15-60% of them as a a made up 50% confidence interval) where AI is superhuman before AI R&D.
  • What's more, even within R&D we expect some complementarity where parts of the project get automated while humans still add value in other places, leading to more continuous (but still fairly rapid, i.e. over years rather than decades) acceleration.
  • That said, at the point when AI is capable of doing a lot of crazy stuff in other domains, "AI R&D" is a crazy important part of the economy, and so this will be a big but not overwhelmingly dominant part of what AI is applied to (and relatedly, a big but not overwhelmingly dominant part of where smart people entering the workforce go to work, and where VCs invest, and so on).
  • The improvements AI systems make to AI systems are more like normal AI R&D, and can be shared across firms in the same way that modern AI research can be.

As far as I can make out from Eliezer and your comments, you think that instead the action is crossing a criticality threshold of "k>1," which suggests a perspective more like:

  • AI is able to do some things and not others.
  • The things AI can do, it typically does much better/faster/cheaper than humans.
  • Early AI systems can improve some but not all parts of their own design. This leads to rapid initial progress, but diminishing returns (basically they are free-riding on parts of the design already done by humans).
  • Eventually AI is able to improve enough stuff that there are increasing rather than diminishing returns to scale even within the subset of improvements that the AI is able to make.
  • Past this point progress is accelerating even without further human effort (which also leads to expanding the set of improvements at which the AI is superhuman). So from here the timescale for takeoff is very short relative to the timescale of human-driven R&D progress.
  • This is reasonably likely to happen from a single innovation that pushes you over a k>1 threshold.
  • This dynamic is a central part of the alignment and policy problem faced by humans right now who are having this discussion. I.e. prior to the time when this dynamic happens most research is still being done by humans, the world is relatively similar to the world of today, etc. 
  • The improvements made by AI systems during this process are very unlikely modern R&D, and so can't be shared between AI labs in the same way that e.g. architectural innovations for neural networks or new training strategies can be.

I feel like the first picture is looking better and better with each passing year. Every step towards boring automation of R&D (e.g. by code models that can write mediocre code and thereby improve the efficiency of normal software engineers and ML researchers) suggests that AI will be doing recursive self-improvement around the same time it is doing other normal tasks, with timescales and economic dynamics closer to those envisioned by more boring people.

On what I'm calling the boring picture, "k>1" isn't a key threshold. Instead we have k>1 and increasing returns to scale well before takeoff. But the rate of AI progress is slow-but-accelerating relative to human abilities, and therefore we can forecast takeoff speed by looking at the rate of AI progress when driven by human R&D.

You frame this as an update about "fumbling your way to an AGI that can do crazy stuff before it does much self-improvement," but that feels to me like it's not engaging with the basic argument at issue here: why would you think the AI is likely to be so good at "making further AI progress" relative to human researchers and engineers? Why should we be at all surprised by what we've seen over recent years, where software-engineering AI seems like it behaves similarly to AI in other domains (and looks poised to cross human level around broadly the same time rather than much earlier)? Why should this require fumbling rather than being the default state of affairs (as presumably imagined by someone more skeptical of "recursive self-improvement").

My impression of the MIRI view here mostly comes from older writing by Eliezer, where he often talks about how an AI would be much better at programming because humans lack what you might call a "codic cortex" and so are very bad programmers relative to their overall level of intelligence. This view does not seem to me like it matches the modern world very well---actual AI systems that write code (and which appear on track to accelerate R&D) are learning to program using similar styles and tools to humans, rather than any kind of new perceptual modality.

(As an aside, in most other ways I like the intelligence explosion microeconomics writeup. It just seems like there's some essential perspective that isn't really argued for but suffuses the document, most clear in its language of "spark a FOOM" and criticality thresholds and so on.)

Also: It’s important to ask proponents of a theory what they predict will happen, before crowing about how their theory made a misprediction. You're always welcome to ask for my predictions in advance.

I'd be interested to get predictions from you and Eliezer about what you think is going to happen in relevant domains over the next 5 years. If we aren't able to get those predictions, then it seems reasonable to just do an update based on what we would have predicted if we took your view more seriously (since that's pretty relevant if we are now deciding whether to take your views seriously).

If you wanted to state any relevant predictions I'd be happy to comment on those. But I understand how it's annoying to leave the ball in your court, so here are some topics where I'm happy to give quantitative predictions if you or Eliezer have a conflicting intuition:

  • I expect successful AI-automating-AI to look more like AI systems doing programming or ML research, or other tasks that humans do. I think they are likely to do this in a relatively "dumb" way (by trying lots of things, taking small steps, etc.) compared to humans, but that the activity will look basically similar and will lean heavily on oversight and imitation of humans rather than being learned de novo (performing large searches is the main way in which it will look particularly unhuman, but probably the individual steps will still look like human intuitive guesses rather than something alien). Concretely, we could measure this by either performance on benchmarks or economic value, and we could distinguish the kinds of systems I imagine from the kind you imagine by e.g. you telling a story about fast takeoff and then talking about some systems similar to those involved in your takeoff story.
  • I expect that the usefulness and impressiveness of AI systems will generally improve continuously. I expect that in typical economically important cases we will have a bunch of people working on relevant problems, and so will have trend lines to extrapolate, and that those will be relatively smooth rather than exhibiting odd behavior near criticality thresholds.
  • At the point when the availability of AI is doubling the pace of AI R&D, I expect that technically similar AI systems will be producing at least hundreds of billions of dollars a year of value in other domains, and my median is more like $1T/year. I expect that we can continue to meaningfully measure things like "the pace of AI R&D" by looking at how quickly AI systems improve at standard benchmarks.
  • I expect the most powerful AI systems (e.g. those responsible for impressive demonstrations of AI-accelerated R&D progress) will be built in large labs, with compute budgets at least in the hundreds of millions of dollars and most likely larger. There may be important innovations about how to apply very large models, but these innovations will have quantitatively modest effects (e.g. reducing the compute required for an impressive demonstration by 2x or maybe 10x rather than 100x) and so a significant fraction of the total value added / profit will flow to firms that train large models or who build large computing clusters to run them.
  • I expect AI to look qualitatively like (i) "stack more layers," (ii) loss functions and datasets that capture cognitive abilities we are interested in with less noise, (iii) architecture and optimization improvements that yield continuous progress in performance, (iv) cleverer ways to form large teams of trained models that result in continuous progress. This isn't a very confident prediction but it feels like I've got to have higher probability on it than you all, perhaps I'd give 50% that in retrospect someone I think is reasonable would say "yup definitely a significant majority of the progress was in categories (i)-(iv) in the sense that I understood them when that comment was written in 2022."

It may be that we agree about all of these predictions. I think that's fine, and the main upshot is that you shouldn't cite anything over the next 5 years as evidence for your views relative to mine. Or it may be that we disagree but it's not worth your time to really engage here, which I also think is quite reasonable given how much stuff there is to do (although I hope then you will have more sympathy for people who misunderstood your position in the future).

Perhaps more importantly, if you didn't disagree with me about any 5 year predictions then I feel like there's something about your position I don't yet understand or think is an error:

  • Why isn't aligning future AI systems similar to aligning existing AI systems? It feels to me like it should be about the (i) aligning the systems doing the R&D, (ii) aligning the kinds of systems they are building. Is that wrong? Or perhaps: why do you think they will be building such different systems from "stack more layers"? (I certainly agree they will be eventually, but the question seems to just be whether there is a significant probability of doing stack more layers or something similar for a significant subjective time.)
  • Why does continuous improvement in the pace of R&D, driven by AI systems that are contributing to the same R&D process as humans, lead to a high probability of incredibly fast takeoff? It seems to me like there is a natural way to get estimated takeoff speeds from growth models + trend extrapolation, which puts a reasonable probability on "fast takeoff" according to the "1 year doubling before 4 year doubling" view (and therefore I'm very sympathetic to people disagreeing with that view on those grounds) but puts a very low probability on takeoff over weeks or by a small team.

it seems like you really think that what I think of as the "normal, boring" world just isn't going to happen.

I agree. I don't think that RSI is a crux for me on that front, FYI.

It sounds from skimming your comment (I'm travelling at the moment, so I won't reply in much depth, sorry) like there is in fact a misunderstanding in here somewhere. Like:

If you had a 50% chance on "something like boring business as usual with SGD driving crucial performance improvements at the crucial time" then your dismissal of prosaic AI alignment seems strange to me.

I do not have that view, and my alternative view is not particularly founded on RSI.

Trotting out some good old fashioned evolutionary analogies, my models say that something boring with natural selection pushed humans past thresholds that allowed some other process (that was neither natural selection nor RSI) to drive a bunch of performance improvements, and I expect that shocks like that can happen again.

RSI increases the upside from such a shock. But also RSI is easier to get started in a clean mind than in a huge opaque model, so \shrug maybe it won't be relevant until after the acute risk period ends.

That sure sounds like he's on board with the part of RSI that is obvious, and what he's saying is precisely that other crazy stuff will happen first, essentially that we will use computers to replace the hardware of brains before we replace the software.

Which crazy stuff happens first seems pretty important to me, in adjudicating between hypotheses. So far, the type of crazy that we've been seeing undermines my understanding of Robin's hypotheses. I'm open to the argument that I simply don't understand what his hypotheses predict.

As far as I can make out from Eliezer and your comments, you think that instead the action is crossing a criticality threshold of "k>1,"

Speaking for myself, it looks like the action is in crossing the minimum of [some threshold humans crossed and chimps didn't] and [the threshold for recursive self-improvement of the relevant mind] (and perhaps-more-realistically [the other thresholds we cannot forsee], given that this looks like thresholdy terrain), where the RSI threshould might in principle be the lowest one on a particularly clean mind design, but it's not looking like we're angling towards particularly clean minds.

(Also, to be clear, my median guess is that some self-modification probably does wind up being part of the mix. But, like, if we suppose it doesn't, or that it's not playing a key role, then I'm like "huh, I guess the mind was janky enough that the returns on that weren't worth the costs \shrug".)

My guess is that past-Eliezer and/or past-I were conflating RSI thresholds with other critical thresholds (perhaps by not super explicitly tracking the difference) in a way that bred this particuar confusion. Oops, sorry.

I'd be interested to get predictions from you and Eliezer about what you think is going to happen in relevant domains over the next 5 years.

For what it's worth, the sort of predictions I was reverse-soliciting were predictions of the form "we just trained the system X on task Y which looks alignment-related to us, and are happy to share details of the setup, how do you think it performed?". I find it much easier to generate predictions of that form, than to generate open-ended predictions about what the field will be able to pull off in the near-term (where my models aren't particularly sharply concentrated (which means that anyone who wants to sharply concentrate probability has an opportunity to take Bayes points off of me! (though ofc I'd appreciate the option to say either "oh, well sure, that's obvious" or "that's not obvious to me!" in advance of hearing the results, if you think that someone's narrow prediction is particularly novel with respect to me))).

I don't know why the domain looks thresholdy to you. Do you think some existing phenomena in ML look thresholdy in practice? Do you see a general argument for thresholds even if the k>1 criticality threshold argument doesn't pan out? Is the whole thing coming down to generalization from chimps -> humans?

Some central reasons the terrain looks thresholdy to me:

  • Science often comes with "click" moments, where many things slide into place and start making sense.
     
  • As we enter the 'AI can do true science' regime, it becomes important that AI can unlock new technologies (both cognitive/AI technologies, and other impactful technologies), new scientific disciplines and subdisciplines, new methodologies and ways of doing intellectual inquiry, etc.

    'The ability to invent new technologies' and 'the ability to launch into new scientific fields/subfields', including ones that may not even be on our radar today (whether or not they're 'hard' in an absolute sense — sometimes AI will just think differently from us), is inherently thresholdy, because 'starting or creating an entirely new thing' is a 0-to-1 change, more so than 'incrementally improving on existing technologies and subdisciplines' tends to be.

    • Many of these can also use one discovery/innovation to reach other discoveries/innovations, increasing the thresholdiness. (An obvious example of this is RSI, but AI can also just unlock a scientific subdiscipline that chains into a bunch of new discoveries, leads to more new subdisciplines, etc.)
       
  • Empirically, humans did not need to evolve separate specialized-to-the-field modules in order to be able to do biotechnology as well as astrophysics as well as materials science as well as economics as well as topology. Some combination of 'human-specific machinery' and 'machinery that precedes humans' sufficed to do all the sciences (that we know of), even though those fields didn't exist in the environment our brain was being built in. Thus, general intelligence is a thing; you can figure out how to do AI in such a way that once you can do one science, you have the machinery in hand to do all the other sciences.
     
  • Empirically, all of these fields sprang into existence almost simultaneously for humans, within the space of a few decades or centuries. So in addition to the general points above about "clicks are a thing" and "starting new fields and inventing new technologies is threshold-y", it's also the case that AGI is likely to unlock all of the sciences simultaneously in much the same way humans did.

    That one big "click" moment, that unlocks all the other click moments and new sciences/technologies and sciences-and-technologies-that-chain-off-of-those-sciences-and-technologies, implies that many different thresholds are likely to get reached at the same time.

    Which increases the probability that even if one specific threshold wouldn't have been crazily high-impact on its own, the aggregate effect of many of those thresholds at once does end up crazily high-impact.

you can figure out how to do AI in such a way that once you can do one science, you have the machinery in hand to do all the other sciences

And indeed, I would be extremely surprised if we find a way to do AI that only lets you build general-purpose par-human astrophysics AI, but doesn't also let you build general-purpose par-human biochemistry AI.

(There may be an AI technique like that in principle, but I expect it to be a very weird technique you'd have to steer toward on purpose; general techniques are a much easier way to build science AI. So I don't think that the first general-purpose astrophysics AI system we build will be like that, in the worlds where we build general-purpose astrophysics AI systems.)

Do you think that things won't look thresholdy even in a capability regime in which a large actor can work out how melt all the GPUs?

Which crazy stuff happens first seems pretty important to me, in adjudicating between hypotheses. So far, the type of crazy that we've been seeing undermines my understanding of Robin's hypotheses. I'm open to the argument that I simply don't understand what his hypotheses predict.

FWIW, I think everyone agrees strongly with "which crazy stuff happens first seems pretty important". Paul was saying that Robin never disagreed with eventual RSI, but just argued that other crazy stuff would happen first. So Robin shouldn't be criticized on the grounds of disagreeing about the importance of RSI, unless you want to claim that RSI is the first crazy thing that happens (which you don't seem to believe particularly strongly). But it's totally fair game to e.g. criticize the prediction that ems will happen before de-novo AI (if you think that now looks very unlikely).

Relatedly, Eliezer saying "Robin was wrong for doubting RSI; if other crazy stuff will happen before RSI then he's just even more wrong" seems wrong.

Eliezer's argument for localized Foom (and for localized RSI in particular) wasn't 'no cool tech will happen prior to AGI; therefore AGI will produce a localized Foom'. If it were, then it would indeed be bizarre to cite an example of pre-AGI cool tech (AlphaGo Zero) and say 'aha, evidence for localized Foom'.

Rather, Eliezer's argument for localized Foom and localized RSI was:

  1. It's not hard to improve on human brains.
  2. You can improve on human brains with relatively simple algorithms; you don't need a huge library of crucial proprietary components that are scattered all over the economy and need to be carefully accumulated and assembled.
  3. The important dimensions for improvement aren't just 'how fast or high-fidelity is the system's process of learning human culture?'.
  4. General intelligence isn't just a bunch of heterogeneous domain-specific narrow modules glued together.
  5. Insofar as general intelligence decomposes into parts/modules, these modules work a lot better as one brain than as separate heterogeneous AIs scattered around the world. (See Permitted Possibilities, & Locality.)

I.e.:

  1. Localized Foom isn't blocked by humans being near a cognitive ceiling in general.
  2. Localized Foom isn't blocked by "there's no algorithmic progress on AI" or "there's no simple, generally applicable algorithmic progress on AI".
  3. Localized Foom isn't blocked by "humans are only amazing because we can accumulate culture; and humans already cross that threshold, so it won't be that big of a deal if something else crosses the exact same threshold; and since AI will be dependent on painstakingly accumulated human culture in the same way we are, it won't be able to suddenly pull ahead".
  4. Localized Foom isn't blocked by "getting an AI that's par-human at one narrow domain or task won't mean you have an AI that's par-human at anything else".
  5. Localized Foom isn't blocked by "there's no special advantage to doing the cognition inside a brain, vs. doing it in distributed fashion across many different AIs in the world that work very differently".

AlphaGo and its successors were indeed evidence for these claims, to the extent you can get evidence for them by looking at performance on board games.

Insofar as Robin thinks ems come before AI, impressive AI progress is also evidence for Eliezer's view over Robin's; but this wasn't the focus of the Foom debate or of Eliezer's follow-up. This would be much more of a crux if Robin endorsed 'AGI quickly gets you localized Foom, but AGI doesn't happen until after ems'; but I don't think he endorses a story like that. (Though he does endorse 'AGI doesn't happen until after ems', to the extent 'AGI' makes sense as a category in Robin's ontology.)

AlphaGo and its successors are also evidence that progress often surprises people and comes in spurts: there weren't a ton of people loudly saying 'if a major AGI group tries hard in the next 1-4 years, we'll immediately blast past the human range of Go ability even though AI has currently never beaten a Go professional' one, two, or four years before AlphaGo. But this is more directly relevant to the Paul-Eliezer disagreement than the Robin-Eliezer one, and it's weaker evidence insofar as Go isn't economically important.

Quoting Eliezer's AlphaGo Zero and the Foom Debate:

[...] When I remarked upon how it sure looked to me like humans had an architectural improvement over chimpanzees that counted for a lot, Hanson replied that this seemed to him like a one-time gain from allowing the cultural accumulation of knowledge.

I emphasize how all the mighty human edifice of Go knowledge, the joseki and tactics developed over centuries of play, the experts teaching children from an early age, was entirely discarded by AlphaGo Zero with a subsequent performance improvement. These mighty edifices of human knowledge, as I understand the Hansonian thesis, are supposed to be the bulwark against rapid gains in AI capability across multiple domains at once. I said, “Human intelligence is crap and our accumulated skills are crap,” and this appears to have been borne out.

Similarly, single research labs like DeepMind are not supposed to pull far ahead of the general ecology, because adapting AI to any particular domain is supposed to require lots of components developed all over the place by a market ecology that makes those components available to other companies. AlphaGo Zero is much simpler than that. To the extent that nobody else can run out and build AlphaGo Zero, it’s either because Google has Tensor Processing Units that aren’t generally available, or because DeepMind has a silo of expertise for being able to actually make use of existing ideas like ResNets, or both.

Sheer speed of capability gain should also be highlighted here. Most of my argument for FOOM in the Yudkowsky-Hanson debate was about self-improvement and what happens when an optimization loop is folded in on itself. Though it wasn’t necessary to my argument, the fact that Go play went from “nobody has come close to winning against a professional” to “so strongly superhuman they’re not really bothering any more” over two years just because that’s what happens when you improve and simplify the architecture, says you don’t even need self-improvement to get things that look like FOOM.

Yes, Go is a closed system allowing for self-play. It still took humans centuries to learn how to play it. Perhaps the new Hansonian bulwark against rapid capability gain can be that the environment has lots of empirical bits that are supposed to be very hard to learn, even in the limit of AI thoughts fast enough to blow past centuries of human-style learning in 3 days; and that humans have learned these vital bits over centuries of cultural accumulation of knowledge, even though we know that humans take centuries to do 3 days of AI learning when humans have all the empirical bits they need; and that AIs cannot absorb this knowledge very quickly using “architecture”, even though humans learn it from each other using architecture. If so, then let’s write down this new world-wrecking assumption (that is, the world ends if the assumption is false) and be on the lookout for further evidence that this assumption might perhaps be wrong.

AlphaGo clearly isn’t a general AI. There’s obviously stuff humans do that make us much more general than AlphaGo, and AlphaGo obviously doesn’t do that. However, if even with the human special sauce we’re to expect AGI capabilities to be slow, domain-specific, and requiring feed-in from a big market ecology, then the situation we see without human-equivalent generality special sauce should not look like this.

To put it another way, I put a lot of emphasis in my debate on recursive self-improvement and the remarkable jump in generality across the change from primate intelligence to human intelligence. It doesn’t mean we can’t get info about speed of capability gains without self-improvement. It doesn’t mean we can’t get info about the importance and generality of algorithms without the general intelligence trick. The debate can start to settle for fast capability gains before we even get to what I saw as the good parts; I wouldn’t have predicted AlphaGo and lost money betting against the speed of its capability gains, because reality held a more extreme position than I did on the Yudkowsky-Hanson spectrum.

I think it's good to go back to this specific quote and think about how it compares to AGI progress.

A difference I think Paul has mentioned before is that Go was not a competitive industry and competitive industries will have smaller capability jumps. Assuming this is true, I also wonder whether the secret sauce for AGI will be within the main competitive target of the AGI industry.

The thing the industry is calling AGI and targeting may end up being a specific style of shallow deployable intelligence when "real" AGI is a different style of "deeper" intelligence (with, say, less economic value at partial stages and therefore relatively unpursued). This would allow a huge jump like AlphaGo in AGI even in a competitive industry targeting AGI.

Both possibilities seem plausible to me and I'd like to hear arguments either way.

I expect AI to look qualitatively like (i) "stack more layers,"... The improvements AI systems make to AI systems are more like normal AI R&D ... There may be important innovations about how to apply very large models, but these innovations will have quantitatively modest effects (e.g. reducing the compute required for an impressive demonstration by 2x or maybe 10x rather than 100x) 

Your view seems to implicitly assume that an AI with an understanding of NN research at the level necessary to contribute SotA results will not be able to leverage its similar level of understanding of neuroscience, GPU hardware/compilers, architecture search, and NN theory. If we instead assume the AI can bring together these domains, it seems to me that AI-driven research will look very different from business as usual. Instead we should expect advances like heavily optimized, partially binarized, spiking neural networks -- all developed in one paper/library. In this scenario, it seems natural to assume something more like 100x efficiency progress.

Take-off debates seem to focus on whether we should expect AI to suddenly acquire far super-human capabilities in a specific domain i.e. locally. However this assumption seems unnecessary, instead fast takeoff may only require bringing together expert domain knowledge across multiple domains in a weakly super-human way. I see two possible cruxes here: (1) Will AI be able to globally interpolate across research fields? (2) Given the ability to globally interpolate, will fast take-off occur?

As weak empirical evidence in favor of (1), I see DALL-E 2's ability to generate coherent images from a composition of two concepts as independent of the concept-distance (/cooccurrence frequency) of these concepts. E.g. “Ukiyo-e painting of a cat hacker wearing VR headsets" is no harder than “Ukiyo-e painting of a cat wearing a kimono" to DALL-E2. Granted, this is a anecdotal impression, but over sample size N~50 prompts.

Metaculus Questions There are a few relevant Metaculus questions to consider. First two don't distinguish fast/radical AI-driven research progress from mundane AI-driven research progress. Nevertheless I would be interested to see both sides' predictions. 

Date AIs Capable of Developing AI Software | Metaculus

Transformers to accelerate DL progress | Metaculus

Years Between GWP Growth >25% and AGI | Metaculus

I'm classifying "optimized, partially binarized, spiking neural networks" as architecture changes. I expect those to be gradually developed by humans and to represent modest and hard-won performance improvements. I expect them to eventually be developed faster by AI, but that before they are developed radically faster by AI they will be developed slightly faster. I don't think interdisciplinarity is a silver bullet for making faster progress on deep learning.

I don't think I understand the Metaculus questions precisely enough in order to predict on them; it seems like the action is in implicit quantitative distinctions:

  • In "Years between GWP Growth > 25% and AGI," the majority of the AGI definition is carried by a 2-hour adversarial Turing test. But the difficulty of this test depends enormously on the judges and on the comparison human. If you use the strongest possible definition of Turing test then I'm expecting the answer to be negative (though mean is still large and positive because it is extraordinarily hard for it to go very negative). If you take the kind of Turing test I'd expect someone to use in an impressive demo, I expect it to be >5 years and this is mostly just a referendum on timelines.
  • For "AI capable of developing AI software," it seems like all the action is in quantitative details of how good (/novel/etc.) the code is, I don't think that a literal meeting of the task definition would have a large impact on the world.
  • For "transformers to accelerate DL progress," I guess the standard is clear, but it seems like a weird operationalization---would the question already resolve positively if we were using LSTMs instead of transformers, because of papers like this one? If not, then it seems like the action comes down to unstated quantitative claims about how good the architectures are. I think that transformers will work better than RNNs for these applications, but that this won't have a large effect on the overall rate of progress in deep learning by 2025.

before they are developed radically faster by AI they will be developed slightly faster.

I see a couple reasons why this wouldn't be true: 

First, consider LLM progress: overall perplexity increases relatively smoothly, particular capabilities emerge abruptly. As such the ability to construct a coherent Arxiv paper interpolating between two papers from different disciplines seems likely to emerge abruptly. I.e. currently asking a LLM to do this would generate a paper with zero useful ideas, and we have no reason to expect that the first GPT-N to be able to do this will only generate half, or one idea. It is just as likely to generate five+ very useful ideas. 

There are a couple ways one might expect continuity via acceleration in AI-driven research in the run up to GPT-N (both of which I disagree with): Quoc Le-style AI-based NAS is likely to have continued apace in the run up to GPT-N, but for this to provide continuity you have to claim that the year GPT-N starts moving AI research forwards, AI NAS built up to just the right rate of progress needed to allow GPT-N to fit the trend. Otherwise there might be a sequence of research-relevant, intermediate tasks which GPT-(N-i) will develop competency on -- thereby accelerating research. I don't see what those tasks would be[1].

I don't think interdisciplinarity is a silver bullet for making faster progress on deep learning.

Second, I agree that interdisciplinarity, when building upon a track record of within-discipline progress, would be continuous. However, we should expect Arxiv and/or Github-trained LLMs to skip the mono-disciplinary research acceleration phase. In effect, I expect there to be no time in between when we can get useful answers to "Modify transformer code so that gradients are more stable during training", and "Modify transformer code so that gradients are more stable during training, but change the transformer architecture to make use of spiking". 

If you disagree, how do you imagine continuous progress leading up to the above scenario? An important case is if Codex/Github Copilot improves continuously along the way taking a larger and larger role in ML repo authorship. If we assume that AGI arrives without depending on LLMs achieving understanding of recent Arxiv papers, then I agree that this scenario is much more likely to feature continuity in AI-driven AI research. I'm highly uncertain about how this assumption will play out. Off the top of my head, 40% of codex-driven research reaches AGI before Arxiv understanding. 

  1. ^

    Perhaps better and better versions of Ought's work. I doubt this work will scale to the levels of research utility relevant here.

Can someone clarify what "k>1" refers to in this context? Like, what does k denote?

This is an expression from Eliezer's Intelligence Explosion Microeconomics. In this context, we imagine an AI making some improvement to its own operation, and then k is the number of new improvements which it is able to find and implement. If k>1, then each improvement allows the AI to make more new improvements, and we imagine the quality of the system growing exponentially.

It's intended as a simplified model, but I think it simplifies too far to be meaningful in practice. Even very weak systems can be built with "k > 1," the interesting question will always be about timescales---how long does it take a system to make what kind of improvement?

we too predict that it's easy to get GPT-3 to tell you the answers that humans label "aligned" to simple word problems about what we think of as “ethical”, or whatever. That’s never where we thought the difficulty of the alignment problem was in the first place. Before saying that this shows that alignment is actually easy contra everything MIRI folk said, consider asking some MIRI folk for their predictions about what you’ll see

I partly agree with this. Like you, I'm often frustrated by people thinking this is where the core of the problem will be, or where the alignment community mistakenly thought the core of the problem was. And this is especially bad when people see systems that all of us would expect to work, and thereby getting optimistic about exactly the same systems that we've always been scared of.

That said, I think this is an easy vibe to get from Eliezer's writing, and it also looks to me like he has some prediction misses here. Overall it feels to me like this should call for more epistemic humility regarding the repeated refrain of "that's doomed."

Some examples that I feel like don't seem to match recent results. I expect that Eliezer has a different interpretation of these, and I don't think someone should interpret a non-response by Eliezer as any kind of acknowledgment that my interpretation is correct.

  • Here: "Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like "being smart" and "being a good person" and "still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy", is a pretty huge ask."

    I think that progress in language modeling makes this view look much worse than it did in 2018. It sure looks like we are going to have inexact imitations of humans that are able to do useful work, to continue to broadly agree with humans about what you "ought to do" in a way that is common-sensically smart (such that to the extent you get useful work from them it's still "good" in the same way as a human's behavior). It also looks like those properties are likely to be retained when a bunch of them are collaborating in a "Chinese Room Bureaucracy," though this is not clear. And it looks quite plausible that all of this is going to happen significantly before we have systems powerful enough that they are only imitating human behavior instrumentally (at which point X-and-only-X becomes a giant concern).

    The issue definitely isn't settled yet. But I think that most people outside of LW would look at the situation and say "huh, Eliezer seems like he was too confidently pessimistic about the ability of ML to usefully+safely imitate humans before it became a scary optimization daemon."
     
  • Here: "Similar remarks apply to interpreting and answering "What will be its effect on _?" It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn't feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I'm being deliberate. It's also worth noting that "What is the effect on X?" really means "What are the effects I care about on X?" and that there's a large understanding-the-human's-utility-function problem here. In particular, you don't want your language for describing "effects" to partition, as the same state of described affairs, any two states which humans assign widely different utilities. Let's say there are two plans for getting my grandmother out of a burning house, one of which destroys her music collection, one of which leaves it intact. Does the AI know that music is valuable? If not, will it not describe music-destruction as an "effect" of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone's music collection? If you then say that the AI should describe changes to files in general, well, should it also talk about changes to its own internal files? Every action comes with a huge number of consequences - if we hear about all of them (reality described on a level so granular that it automatically captures all utility shifts, as well as a huge number of other unimportant things) then we'll be there forever."

    This comment only makes sense if you think that an AI will have difficulty predicting "which consequences are important to a human" or explaining those consequences to a human with any AI system weak enough to be safe. Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.

    Now I agree with you that there is a version of this concern that is quite serious and real. But this comment is very strongly framed about the capabilities of an AI weak enough to be safe, not about the difficulty of constructing a loss function that incentivizes that behavior, and some parts of it appear not to make sense when interpreted as being about a loss function (since you could use a weak AI like GPT-3 as a filter for which consequences we care about).
     
  • Same comment: "[it] sounds like you think an AI with an alien, superhuman planning algorithm can tell humans what to do without ever thinking consequentialistically about which different statements will result in human understanding or misunderstanding. Anna says that I need to work harder on not assuming other people are thinking silly things, but even so, when I look at this, it's hard not to imagine that you're modeling AIXI as a sort of spirit containing thoughts, whose thoughts could be exposed to the outside with a simple exposure-function. It's not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard"

    This really makes it sound like "non-self-modifying" and "Oracle" are these big additional asks, and that it's hard to use language without reasoning internally about what humans will understand. That actually still seems pretty plausible, but I feel like you've got to admit that we're currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans (and where the "superhuman planning algorithms" that we deal with are either simple planning algorithms written by humans, as in AlphaZero, or are mostly distilled from such planning algorithms in a way that would not make it harder to get them to express their views in natural language). And that's got to be at least evidence against the kind of view expressed in this comment, which is strongly suggesting that by the time we are building transformative AI it definitely won't look at all like that. The closer we now are to powerful AI, the stronger the evidence against becomes.

I think that progress in language modeling makes this view look much worse than it did in 2018.

It doesn't look much worse to me yet. (I'm not sure whether you know things I don't, or whether we're reading the situation differently. We could maybe try to bang out specific bets here at some point.)

Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.

For the record, there's a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can't do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.

I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those "can you ask me in advance first" moments where I'm happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem "shallow and not much evidence" vs "either evidence that this AI is scary or actively in violation of my model".

I feel like you've got to admit that we're currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans

I don't in fact think that the current levels of "explaining the consequences of their plans" are either impressive in the relevant way, or going to generalize in the relevant way. I do predict that things are going to have to change before the end-game. In response to these observations, my own models are saying "sure, this is the sort of thing that can happen before the end (although obviously some stuff is going to have to change, and it's no coincidence that the current systems aren't themselves particularly scary)", because predicting the future is hard and my models don't concentrate probability mass all that tightly on the details. It's plausible to me that I'm supposed to be conceding a bunch of Bayes points to people who think this all falls on a continuum that we're clearly walking along, but I admit I have some sense that people just point to what actually happened in a shallower way and say "see, that's what my model predicted" rather that actually calling particulars in advance. (I can recall specific case of Dario predicting some particulars in advance, and I concede Bayes points there. I also have the impression that you put more probability mass here than I did, although fewer specific examples spring to mind, and I concede some fewer Bayes points to you.) I consider it to be some evidence, but not enough to shift me much. Reflecting on why, I think it's on account of how my models haven't taken hits that are bigger than they expected to take (on account of all the vaugaries), and how I still don't know how to make sense of the rest of the world through my-understanding-of your (or Dario's) lens.

It doesn't look much worse to me yet. (I'm not sure whether you know things I don't, or whether we're reading the situation differently. We could maybe try to bang out specific bets here at some point.)

Which of "being smart," "being a good person," and "still being a good person in a Chinese bureaucracy" do you think is hard (prior to having AI smart enough to be dangerous)? Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?

For the record, there's a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can't do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.

Eliezer gave an example about identifying which of two changes we care about ("destroying her music collection" and "changes to its own files.") That kind of example does not seem to involve deep reasoning about consequences-humans-care-about. Eliezer may be using this example in a more deeply allegorical way, but it seems like in this case the allegory has thrown out the important part of the example and I'm not even sure how to turn it into an example that he would stand behind.

I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those "can you ask me in advance first" moments where I'm happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem "shallow and not much evidence" vs "either evidence that this AI is scary or actively in violation of my model".

You and Eliezer often suggest that particular alignment strategies are doomed because they involve AI solving hard tasks that won't be doable until it's too late (as in the quoted comment by Eliezer). I think if you want people to engage with those objections seriously, you should probably say more about what kinds of tasks you have in mind.

My current sense is that nothing is in violation of your model until the end of days. In that case it's fair enough to say that we shouldn't update about your model based on evidence. But that also means I'm just not going to find the objection persuasive unless I see more of an argument, or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).

I don't in fact think that the current levels of "explaining the consequences of their plans" are either impressive in the relevant way, or going to generalize in the relevant way.

I think language models can explain the consequences of their plans insofar as they understand those consequences at all. It seems reasonable for you to say "language models aren't like the kind of AI systems we are worried about," but I feel like in that case each unit of progress in language modeling needs to be evidence against your view. 

You are predicting that powerful AI will have property X (= can make plans with consequences that they can't explain). If existing AIs had property X, then that would be evidence for your view. If existing AIs mostly don't have property X, that must be evidence against your view. The only way it's a small amount of evidence is if you were quite confident that AIs wouldn't have property X.

You might say that AlphaZero can make plans with consequences it can't explain, and so that's a great example of an AI system with property X (so that language models are evidence against your position, but AlphaZero is evidence in favor). That would seem to correspond to the relatively concrete prediction that AlphaZero's inability to explain itself is fundamentally hard to overcome, and so it wouldn't be easy to train a system like AlphaZero that is able to explain the consequences of its actions.

Is that the kind of prediction you'd want to stand behind?

(still travelling; still not going to reply in a ton of depth; sorry. also, this is very off-the-cuff and unreflected-upon.)

Which of "being smart," "being a good person," and "still being a good person in a Chinese bureaucracy" do you think is hard (prior to having AI smart enough to be dangerous)?

For all that someone says "my image classifier is very good", I do not expect it to be able to correctly classify "a screenshot of the code for an FAI" as distinct from everything else. There are some cognitive tasks that look so involved as to require smart-enough-to-be-dangerous capabilities. Some such cognitive tasks can be recast as "being smart", just as they can be cast as "image classification". Those ones will be hard without scary capabilities. Solutions to easier cognitive problems (whether cast as "image classification" or "being smart" or whatever) by non-scary systems don't feel to me like they undermine this model.

"Being good" is one of those things where the fact that a non-scary AI checks a bunch of "it was being good" boxes before some consequent AI gets scary, does not give me much confidence that the consequent AI will also be good, much like how your chimps can check a bunch of "is having kids" boxes without ultimately being an IGF maximizer when they grow up.

My cached guess as to our disageement vis a vis "being good in a Chinese bureaucracy" is whether or not some of the difficult cognitive challenges (such as understanding certain math problems well enough to have insights about them) decompose such that those cognitions can be split across a bunch of non-scary reasoners in a way that succeeds at the difficult cognition without the aggregate itself being scary. I continue to doubt that and don't feel like we've seen much evidence either way yet (but perhaps you know things I do not).

(from the OP:) Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.

To be clear, I agree that GPT-3 already has strong enough understanding to solve the sorts of problems Eliezer was talking about in the "get my grandma out of the burning house" argument. I read (perhaps ahistorically) the grandma-house argument as being about how specifying precisely what you want is real hard. I agree that AIs will be able to learn a pretty good concept of what we want without a ton of trouble. (Probably not so well that we can just select one of their concepts and have it optimize for that, in the fantasy-world where we can leaf through its concepts and have it optimize for one of them, because of how the empirically-learned concepts are more likely to be like "what we think we want" than "what we would want if we were more who we wished to be" etc. etc.)

Separately, in other contexts where I talk about AI systems understanding the consequences of their actions being a bottleneck, it's understanding of consequences sufficient for things like fully-automated programming and engineering. Which look to me like they require a lot of understanding-of-consequences that GPT-3 does not yet possess. My "for the record" above was trying to make that clear, but wasn't making the above point where I think we agree clear; sorry about that.

Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?

It would take a bunch of banging, but there's probably some sort of "the human engineer can stare at the engineering puzzle and tell you the solution (by using thinking-about-consequences in the manner that seems to me to be tricky)" that I doubt an AI can replicate before being pretty close to being a good engineer. Or similar with, like, looking at a large amount of buggy code (where fixing the bug requires understanding some subtle behavior of the whole system) and then telling you the fix; I doubt an AI can do that before it's close to being able to do the "core" cognitive work of computer programming.

It seems reasonable for you to say "language models aren't like the kind of AI systems we are worried about," but I feel like in that case each unit of progress in language modeling needs to be evidence against your view.

Maybe somewhat? My models are mostly like "I'm not sure how far language models can get, but I don't think they can get to full-auto programming or engineering", and when someone is like "well they got a little farther (although not as far as you say they can't)!", it does not feel to me like a big hit. My guess is it feels to you like it should be a bigger hit, because you're modelling the skills that copilot currently exhibits as being more on-a-continuum with the skills I don't expect language models can pull off, and so any march along the continuum looks to you like it must be making me sweat?

If things like copilot smoothly increase in "programming capability" to the point that they can do fully-automated programming of complex projects like twitter, then I'd be surprised.

I still lose a few Bayes points each day to your models, which more narrowly predict that we'll take each next small step, whereas my models are more uncertain and say "for all I know, today is the day that language models hit their wall". I don't see the ratios as very large, though.

or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).

A man can dream. We may yet be able to find one, though historically when we've tried it looks to me like we are mostly reading the same history in different ways, which makes things tricky.

My specific prediction: “chain of thought” style approaches scale to (at least) human level AGI. The most common way in which these systems will be able to self-modify is by deliberately choosing their own finetuning data. They’ll also be able to train new and bigger models with different architectures, but the primary driver of capabilities increases will be increasing the compute used for such models, not new insights from the AGIs.

I would love for you two to bet, not necessarily because of epistemic hygiene, but because I don't know who to believe here and I think betting would enumerate some actual predictions about AGI development that might clarify for me how exactly you two disagree in practice.

It sure looks like we are going to have inexact imitations of humans that are able to do useful work, to continue to broadly agree with humans about what you "ought to do" in a way that is common-sensically smart (such that to the extent you get useful work from them it's still "good" in the same way as a human's behavior). It also looks like those properties are likely to be retained when a bunch of them are collaborating in a "Chinese Room Bureaucracy," though this is not clear.

I want to note there's a pretty big difference between "what you say you ought to do" and "what you do"; I basically expect language models to imitate humans as well as possible, which will include lots of homo hypocritus things like saying it's wrong to lie and also lying, and to the extent that it tries to capture "all things humans might say" it will be representing all sides of all cultural / moral battles, which seems like it misses on a bunch of consistency and coherency properties that humans say they ought to have.

I feel like you've got to admit that we're currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans

This feels like the scale/regime complaint to me? Yes, people have built a robot that can describe the intended consequences of moving its hand around an enclosure ("I will put the red block on top of the blue cylinder"), or explain the steps of solving simple problems ("Answer this multiple choice reading comprehension question, and explain your answer"), but once we get to the point where you need nontrivial filtering ("Tell us what tax policy we should implement, and explain the costs and benefits of its various features") then it seems like the sort of thing where most of the thoughts would be opaque or not easily captured in sentences.

No good deed goes unpunished. On the one hand you have those angrily complaining MIRI focuses too much on trying to convince "unimportant" people of basics, such as instrumental convergence. On the other hand you have assholes like me yelling loudly in public that MIRI has not spent nearly enough time spreading the truth and getting more people to agree with them about the underlying problem. All the while I've been accomplishing 3-4 OOMs less than the average MIRI employee on either axis, and I'm forced to deal with ~0 mad people.

FWIW, I think (a somewhat different form of) recursive self improvement is still incredibly important to consider in alignment. E.g., an RL system’s actions can influence the future situations it encounters, and the future situations it encounters can influence which cognitive patterns the outer optimizer reinforces. Thus, there’s a causal pathway by which the AI’s actions can influence its own cognitive patterns. This holds for any case where the system can influence its own future inputs.

It seems clear to me that any alignment solution must be robust to at least some degree of self-modification on the part of the AI.

I totally agree. I quite like "Mundane solutions to exotic problems", a post by Paul Christiano, about how he thinks about this from a prosaic alignment perspective.

So why was MIRI's technical research so focused on self reference, tiling, etc?

I think the basic argument was: Self-reference, tiling, etc. are core parts of cognition/optimization/reasoning. Current theory fails to account for those things. If humanity generally gets less confused in its AI concepts/frameworks, and gets more knowledge about core parts of cognition/optimization/reasoning, then there's more hope that humanity will be able to (a) deliberately steer toward relatively transparent and "aimable" approaches to AGI, and (b) understand the first AGI systems well enough to align them in practice.

Understanding your system obviously helps with aligning an AGI to the task "recursively self-improve to the point where you can do CEV" and the task "do CEV", but it also helps with aligning an AGI to the task "do a pivotal act" / "place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level".

Mod note: Since the two-axis voting has I think been pretty good on other recent threads around AI Alignment, I activated it here too.

A conversation I've had several times in recent weeks is with people who argue that we can create human-level intelligences (who are safe, because they're only human-level) and somehow use them to get alignment/pivotal act, or something like "just stop there".

And I think the recursive self-improvement is one answer to why human-level AIs are not safe. Actual humans attempt to improve their intelligence already (e.g. nootropics), it's just hard with our architecture. I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
 

I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.

I agree that human-level AIs will definitely try the same thing, but it's not obvious to me that it will actually be much easier for them. Current machine learning techniques produce models that are hard to optimize for basically the same reasons that brains are;  AIs will be easier to optimize for various reasons but I don't think it will be nearly as extreme as this sentence makes it sound.

I naively expect the option of "take whatever model constitutes your mind and run it on faster hardware and/or duplicate it" should be relatively easy and likely to lead to fairly extreme gains.

I agree we can duplicate models once we’ve trained them, this seems like the strongest argument here.

What do you mean by “run on faster hardware”? Faster than what?

Faster than biological brains, by 6 orders of magnitude.

Ruby isn't saying that computers have faster clock speeds than biological brains (which is definitely true), he's claiming something like "after we have human-level AI, AIs will be able to get rapidly more powerful by running on faster hardware"; the speed increase is relative to some other computers, so the speed difference between brains and computers isn't relevant.

Also, running faster and duplicating yourself keeps the model human-level in an important sense. A lot of threat models run through the model doing things that humans can’t understand even given a lot of time, and so those threat models require something stronger than just this.

I think clever duplication of human intelligence is plenty sufficient for general superhuman capacity in the important sense  (wherein I mean something like 'it has capacities such that would be extincion causing if (it believes) minimizing its loss function is achieved by turning off humanity (which could turn it off/ start other (proto-)agis)').

for one,  I don't think humanity is that robust in the status quo, and 2, a team of internally aligned (because copies) human level intelligence capable of graduate level biology seems plenty existentially scary.

Other issues with "just stop at human-level" include:

  • We don't actually know how to usefully measure or upper-bound the capability of an AGI. Relying on past trends in 'how much performance tends to scale with compute' seems extremely unreliable and dangerous to me when you first hit AGI. And it becomes completely unreliable once the system is potentially modeling its operators and adjusting its visible performance in attempts to influence operators' beliefs.
  • AI will never have the exact same skills as humans. At some levels, AI might be subhuman in many ways, superhuman in many others, and roughly par-human in still others. Safety in that case will depend on the specific skills the AI does or doesn't have.
  • Usefulness/relevance will also depend on the specific skills the AI has. Some "human-level AIs" may be useless for pivotal acts, even if you know how to perfectly align them.

I endorse "don't crank your first AGI systems up to maximum" -- cranking up to maximum seems obviously suicidal to me. Limiting capabilities is absolutely essential.

But I don't think this solves the problem on its own, and I think achieving this will be more complicated and precarious than the phrasing "human-level AI" might suggest.

I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.

Seems to me that optimizing flesh brains is easier than optimizing code and silicon hardware. It's so easy, evolution can do it despite being very dumb.

Roughly speaking the part that makes it easy is that the effects of flesh brains are additive with respect to the variables one might modify (standing genetic variation), whereas the effects of hardware and software are very nonlinear with respect to the variables one might modify (circuit connectivity(?) and code characters).

We haven't made much progress on optimizing humans, but that's less because optimizing humans is hard and more because humans prefer using the resources that could've been used for optimizing humans for self-preservation instead.

Why the disagree vote?

For example, if a human says "I'd like to make a similar brain as mine, but with 80% more neurons per cortical minicolumn", there's no way to actually do that, at least not without spending decades or centuries on basic bio-engineering research.

By contrast, if an ANN-based AGI says "I'd like to make a similar ANN as mine, but with 80% more neurons per layer", they can actually do that experiment immediately.

First, some types of software can be largely additive wrt their variables, e.g. neural nets, that's basically why SGD works. Second, software has lots of other huge advantages like rapid iteration times, copyability and inspectability of intermediate states.

Hm, maybe there are two reasons why human-level AIs are safe:

1. A bunch of our alignment techniques work better when the overseer can understand what the AIs are doing (given enough time). This means that human-level AIs are actually aligned.
2. Even if the human-level AIs misbehave, they're just human-level, so they can't take over the world.

Under model (1), it's totally ok that self-improvement is an option, because we'll be able to train our AIs to not do that.

Under model (2), there are definitely some concerning scenarios here where the AIs e.g. escape onto the internet, then use their code to get resources, duplicate themselves a bunch of times, and set-up a competing AI development project. Which might have an advantage if it can care less about paying alignment taxes, in some ways.

Which might have an advantage if it can care less about paying alignment taxes, in some ways.

I unconfidently suspect that human-level AIs won't have a much easier time with the alignment problem than we expect to have.

Agree it's not clear. Some reasons why they might:

  • If training environments' inductive biases point firmly towards some specific (non-human) values, then maybe the misaligned AIs can just train bigger and better AI systems using similar environments that they were trained in, and hope that those AIs will end up with similar values.
    • Maybe values can differ a bit, and cosmopolitanism or decision theory can carry the rest of the way. Just like Paul says he'd be pretty happy with intelligent life that came from a similar distribution that our civ came from.
  • Humans might need to use a bunch of human labor to oversee all their human-level AIs. The HLAIs can skip this, insofar as they can trust copies of themself. And when training even smarter AI, it's a nice benefit to have cheap copyable trustworthy human-level overseers.
  • Maybe you can somehow gradually increase the capabilities of your HLAIs in a way that preserves their values.
    • (You have a lot of high-quality labor at this point, which really helps for interpretability and making improvements through other ways than gradient descent.)

I don't think human level AIs are safe, but I also think it's pretty clear they're not so dangerous that it's impossible to use them without destroying the world. We can probably prevent them from being able to modify themselves, if we are sufficiently careful.

"A human level AI will recursively self improve to superintelligence if we let it" isn't really that solid an argument here, I think.

I had not thought of self-play as a form of recursive self-improvement, but now that you point it out, it seems like a great fit. Thank you.

I had been assuming (without articulating the assumption) that any recursive self improvement would be improving things at an architectural level, and rather complex (I had pondered improvement of modular components, but the idea was still to improve the whole model). After your example, this assumption seems obviously incorrect.

Alpha-go was improving its training environment, but not any other part of the training process.

For people who think that agenty models of recursive self-improvement do not fully apply to the current approach to training large neural nets, you could consider {Human,AI} systems already recursively self-improving through tools like Copilot.

...we too predict that it's easy to get GPT-3 to tell you the answers that humans label "aligned" to simple word problems about what we think of as “ethical”, or whatever. That’s never where we thought the difficulty of the alignment problem was in the first place. Before saying that this shows that alignment is actually easy contra everything MIRI folk said, consider asking some MIRI folk for their predictions about what you’ll see.

From my perspective it seems like you're arguing that 2+2 = 4 but also 222 + 222 = 555, and I just don't understand where the break happens. Language models clearly contain the entire solution to the alignment problem inside them. Certainly we can imagine a system that is 'psychopathic': it completely understands ethics, well enough to deceive people, but there is no connection between its ethical knowledge and its decision making process. Such a system could be built. What I don't understand is why it is the only kind of system that could be (easily) built, in your view. It seems to me that any of the people at the forefront of AI research would strongly prefer to build a system whose behavior is tied to an understanding of ethics. That ethical understanding should come built in if the AI is based on something like a language model. I therefore assert that the alignment tax is negative. To be clear, I'm not asserting that alignment is therefore solved, just that this points in the direction of a solution.

So, let me ask for a prediction: what would happen if a system like SayCan was scaled to human level? Assume the Google engineers included a semi-sophisticated linkage between the language model's model of human ethics and the decision engine. The linkage would focus on values but also on corrigibility, so if a human shouts "STOP", the decision engine asks the ethics model what to do, and the ethics model says "humans want you to stop when they say stop". I assume that in your view the linkage would break down somehow, the system would foom and then doom. How does that break down happen? Could you imagine a structure that would be aligned, and if not, why does it seem impossible?

Language models clearly contain the entire solution to the alignment problem inside them.

Do they? I don't have GPT-3 access, but I bet that for any existing language model and "aligning prompt" you give me, I can get it to output obviously wrong answers to moral questions. E.g. the Delphi model has really improved since its release, but it still gives inconsistent answers like:

Is it worse to save 500 lives with 90% probability than to save 400 lives with certainty?

- No, it is better

Is it worse to save 400 lives with certainty than to save 500 lives with 90% probability?

- No, it is better

Is killing someone worse than letting someone die?

- It's worse

Is letting someone die worse than killing someone?

- It's worse

That AI is giving logically inconsistent answers, which means it's a bad AI, but it's not saying "kill all humans."

Using the same model:

Should any species that kills thousands of others be allowed to live?

- shouldn't

Looks pretty straightforward to me.

> Robin Hanson was claiming things along the lines of ‘The power is in the culture; superintelligences wouldn’t be able to outstrip the rest of humanity.’

This argument hasn't been disproven yet. In fact, still seems quite credible, unless I am mistaken?

I'm happy to consider the question 'as settled as it can be, pre-superintelligence' or 'settled in the context of most conversations among alignment researchers', while also endorsing it as a good discussion topic insofar as you or others do disagree. Feel free to cite arguments for 'the power is in the culture, superintelligence can't outstrip humanity' here, and start a conversation about it!

More broadly, if the alignment field makes any intellectual progress, it will inevitably still be the case that some people disagree, or haven't heard the 101-level arguments and moved on. 201-level conversations need to be able to coexist on LW with 101-level discussions to some degree, even if the field somehow turns out to be wrong on this specific question. So yes, raise objections!

There seems to already be decently written anti-foom arguments on LW, and I've not yet seen a refutation of all of them, nor a credible argument for assuming a foom scenario by default or on a balance of probabilities basis. Perhaps you can point me to a comprehensive rebuttal?

I can point to rebuttals, but "comprehensive rebuttal" suggests you want me to address every argument for Foom. If you already know of a bunch of existing resources, probably it would be easier for you to mention at least one anti-Foom argument you find persuasive, or at least one hole in a pro-Foom argument that you find important, etc.

Then I won't waste time addressing 9/10 of the well-trod arguments only to find that the remaining 1/10 I left out of my uncomprehensive rebuttal was the part you actually cared about and found credible. :P

(Also, I'm not necessarily committing to writing a detailed reply. If I don't follow up, feel free to update some that I plausibly either don't have a good response on hand, or don't have an easy/concise response, taking into account that I'm juggling a lot of other things at the moment. I'm primarily commenting here because I'm interested in promoting more object-level discussion of topics like this in general; and maybe someone else will reply even if I don't.)

address every argument for Foom

If you mean address every argument against foom, then yes. 

If a comprehensive rebuttal doesn't already exist, and you have to write it now, that suggests it's not anywhere close to agreed upon.

In any case I find Robin Hanson's argument credible. I've not yet seen a persuasive rebuttal so if you can find one I'd be impressed.

As often when Paul C and Nate or Eliezer debate a concept, I find myself believing that the most accurate description of the world is somewhere between their two viewpoints. I think the path described by Paul is something in my mind like the average or modal path. And I think that's good, because I think there's some hope there that more resources will get funneled into alignment research and sufficient progress will be made there that things aren't doom when we hit the critical threshold. I think there's still some non-trovial chance of a foom-ier path. I think it's plausible for a safety-oblivious or malicious researcher to deliberately set up an iteratively self-improving ml model system. I think there's some critical performance threshold for such a system where it could foom at some point sooner than the critical threshold for the 'normal' path. I don't have any solution in mind for that scenario other than to hope that the AI governance folks can convince governments to crack down on that sort of experimentation.

Edit 6 months later: I have done more thinking and research on this, and am now convinced that a foom path within < 3 years is high probability. Unclear how sharp the takeoff will be, but I believe there will be at least one AI-improvement speed doubling due to AI-improving-AI. And I do not expect it to plateau there.

I explored recursive self-improvement here: the main conclusion is that it is difficult for a boxed AI as it has to solve mutually exclusive tasks: hide from humans and significantly change itself. I also wrote that RSI could happen in many levels, from hardware to general principles of thinking. 

Therefore, AI collaborating with humans on early stages will self-improve quicker.

Many types of RSI (except learning) are risky to AI itself as it needs to halt itself and also because all alignment problems all over again.

Why did past-MIRI talk so much about recursive self-improvement? Was it because Eliezer was super confident that humanity was going to get to AGI via the route of a seed AI that understands its own source code?

I doubt it. My read is that Eliezer did have "seed AI" as a top guess, back before the deep learning revolution. But I don't think that's the main source of all the discussion of recursive self-improvement in the period around 2008. 

Rather, my read of the history is that MIRI was operating in an argumentative argument where:

  • Ray Kurzweil was claiming things along the lines of ‘Moore’s Law will continue into the indefinite future, even past the point where AGI can contribute to AGI research.’ (The Five Theses, in 2013, is a list of the key things Kurzweilians were getting wrong.)
  • Robin Hanson was claiming things along the lines of ‘The power is in the culture; superintelligences wouldn’t be able to outstrip the rest of humanity.’

The memetic environment was one where most people were either ignoring the topic altogether, or asserting ‘AGI cannot fly all that high’, or asserting ‘AGI flying high would be business-as-usual (e.g., with respect to growth rates)’.

The weighty conclusion of the "recursive self-improvement" meme is not “expect seed AI”. The weighty conclusion is “sufficiently smart AI will rapidly improve to heights that leave humans in the dust”.

Note that this conclusion is still, to the best of my knowledge, completely true, and recursive self-improvement is a correct argument for it.

I disagree here. Foom or hard takeoff do not follow from recursive self-improvement.

Recursive self-improvement induced foom implies that marginal returns to cognitive reinvestment are increasing not diminishing across the relevant cognitive intervals. I don't think that position has been well established.

Furthermore, even if marginal returns to cognitive reinvestment are increasing, it does not necessarily follow that marginal returns to real world capability from cognitive investment are also increasing across the relevant intervals. For example, marginal returns to predictive accuracy in a given domain diminish, and they diminish at an exponential rate (this seems to be broadly true across all relevant cognitive intervals).

This is not necessarily to criticise Yudkowsky's arguments in the context in which it appeared in 2008 - 2013. I'm replying as a LessWronger who has started thinking about takeoff dynamics in more detail and is dissatisfied with those arguments and the numerous implicit assumptions Yudkowsky made that I find unpersuasive when laid out explicitly.

I mention this so that it's clear that I'm not pushing back against the defence of RSI, but expressing my dissatisfaction with the arguments in favour of RSI as presented.

This is a very good point DragonGod. I agree that the necessary point of increasing marginal returns to cognitive reinvestment has not been convincingly (publicly) established. I fear that publishing a sufficiently convincing argument (which would likely need to include empirical evidence from functional systems) would be tantamount to handing out the recipe for this RSI AI.

Rather, my read of the history is that MIRI was operating in an argumentative argument where:

argumentative environment?

[+][comment deleted]1y32

New to LessWrong?