Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.
Steering towards world states, taken literally, for a realistic agent is impossible, because an embedded agent cannot even contain a representation of a detailed world-state.
I'm not imagining AI steering toward a full specification of a physical universe; I'm imagining it steering toward a set of possible worlds. Sets of possible worlds can often be fully understood by reasoners, because you don't need to model every world in the set in perfect detail in order to understand the set; you just need to understand at least one high-level criterion (or set of criteria) that determines which worlds go in the set vs. not in the set.
E.g., consider the preference ordering "the universe is optimal if there's an odd number of promethium atoms within 100 light years of the Milky Way Galaxy's center of gravity, pessimal otherwise". Understanding this preference just requires understanding terms like "odd" and "promethium" and "light year"; it doesn't require modeling full universes or galaxies in perfect detail.
Similarly, "maximize the amount of diamond that exists in my future light cone" just requires you to understand what "diamond" is and what "the more X you have, the better" means. It doesn't require you to fully represent every universe in your head in advance.
(Note that selecting the maximizing action is computationally intractable; but you can have a maximizing goal even if you aren't perfectly succeeding in the goal.)
The definition I give in the post is "AI that has the basic mental machinery required to do par-human reasoning about all the hard sciences". In footnote 3, I suggest the alternative definition "AI that can match smart human performance in a specific hard science field, across all the scientific work humans do in that field".
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but the threshold I have in mind is more like:
STEM-level AGI is AI that's at least as scientifically productive as a human scientist who makes a variety of novel, original contributions to a hard-science field that requires understanding the physical world well. E.g., it can go toe-to-toe with highly productive human scientists on applying its abstract theories to real-world phenomena, using scientific ideas to design new tech, designing physical experiments, operating equipment, and generating new ideas that turn out to be true and that importantly advance the frontiers of our knowledge.
The way I'm thinking about the threshold, AI doesn't have to be Nobel-prize-level, but it has to be "fully doing science". I'd also be happy with a definition like 'AI that can reason about the physical world in general', but I think that emphasizing hard-science tasks makes it clearer why I'm not thinking of GPT-4 as 'reasoning about the physical world in general' in the relevant sense.
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don't imply any particular world-state to achieve.
If you look from the outside like you're competently trying to steer the world into states that will result in you getting more novel experience, then this is "goal-directed" in the sense I mean, regardless of why you're doing that.
If you (e.g.) look from the outside like you're selecting the local action that's least like the actions you've selected before, regardless of how that affects you or your future novel experience, etc., then that's not "goal-directed" in the sense I mean.
The distinction isn't meant to be totally crisp (there are different degrees and dimensions of "goal-directedness"), but maybe these examples help clarify what I have in mind. "Maximize novel experience" is a pretty vague goal, but it's not so vague that I think it falls outside of what I had in mind -- e.g., I think the standard instrumental convergence concerns apply to "maximize novel experience".
"Steer the world toward there being an even number of planets in the Milky Way Galaxy" also encompasses a variety of possible world-states (more than half of the possible worlds where the Milky Way Galaxy exists are optimal), but I think the arguments in the OP apply just as well to this goal.
Sufficiently intelligent agent knows that its utility function is an approximation of true preferences of the creator.
Nope! Humans were created by evolution, but our true utility function isn't "maximize inclusive reproductive fitness" (nor is it some slightly tweaked version of that goal).
See also, in the OP: "Problem of Fully Updated Deference: Normative uncertainty doesn't address the core obstacles to corrigibility."
Dustin Moskovitz comments on Twitter:
The deployment problem is part of societal response to me, not separate.
[...] Eg race dynamics, regulation (including ability to cooperate with competitors), societal pressure on leaders, investment in watchdogs (human and machine), safety testing norms, whether things get open sourced, infohazards.
"The deployment problem is hard and weird" comes from a mix of claims about AI (AGI is extremely dangerous, you don't need a planet-sized computer to run it, software and hardware can and will improve and proliferate by default, etc.) and about society ("if you give a decent number of people the ability to wield dangerous AGI tech, at least one or them will choose to use it").
The social claims matter — two people who disagree about how readily Larry Page and/or Mark Zuckerberg would put the world at risk might as a result disagree about whether a Good AGI Project has median 8 months vs. 12 months to do a pivotal act.
When I say "AGI ruin rests on strong claims about the alignment problem and deployment problem, not about society", I mean that the claims you need to make about society in order to think the alignment and deployment problems are that hard and weird, are weak claims (e.g. "if fifty random large AI companies had the ability to use dangerous AGI, at least one would use it"), and that the other claims about society required for high p(doom) are weak too (e.g. "humanity isn't a super-agent that consistently scales up its rationality and effort in proportion to a problem's importance, difficulty, and weirdness").
Arguably the difficulty of the alignment problem itself also depends in part on claims about society. E.g., the difficulty of alignment depends on the difficulty of the task we're aligning, which depends on "what sort of task is needed to end the acute x-risk period?", which depends again on things like "will random humans destroy the world if you hand them world-destroying AGI?".
The thing I was trying to communicate (probably poorly) isn't "Alignment, Deployment, and Society partitions the space of topics", but rather:
Note that if it were costless to make the title way longer, I'd change this post's title from "AGI ruin mostly rests on strong claims about alignment and deployment, not about society" to the clearer:
The AGI ruin argument mostly rests on claims that the alignment and deployment problems are difficult and/or weird and novel, not on strong claims about society
One reason I like "the danger is in the space of action sequences that achieve real-world goals" rather than "the danger is in the space of short programs that achieve real-world goals" is that it makes it clearer why adding humans to the process can still result in the world being destroyed.
If powerful action sequences are dangerous, and humans help execute an action sequence (that wasn't generated by human minds), then it's clear why that is dangerous too.
If the danger instead lies in powerful "short programs", then it's more tempting to say "just don't give the program actuators and we'll be fine". The temptation is to imagine that the program is like a lion, and if you just keep the lion physically caged then it won't harm you. If you're instead thinking about action sequences, then it's less likely to even occur to you that the whole problem might be solved by changing the AI from a plan-executor to a plan-recommender. Which is a step in the right direction in terms of actually grokking the nature of the problem.
Thanks for the replies, Ryan!
I think the exact quantitative details make a big difference between "AGI ruin seems nearly certain in the absense of positive miracless" and "doom seems quite plausible, but we'll most likely make it through" (my probability of takeover is something like 35%)
I don't think that 'the very first STEM-level AGI is smart enough to destroy the world if you relax some precautions' and 'we have 2.5 years to work with STEM-level AGI before any system is smart enough to destroy the world' changes my p(doom) much at all. (Though this is partly because I don't expect, in either of those worlds, that we'll be able to be confident about which world we're in.)
If we have 6 years to safely work with STEM-level AGI, that does intuitively start to feel like a significant net increase in p(hope) to me? Though this is complicated by the fact that such AGI probably couldn't do pivotal acts either, and having STEM-level AGI for a longer period of time before a pivotal act occurs means that the tech will be more widespread when it does reach dangerous capability levels. So in the endgame, you're likely to have a lot more competition, and correspondingly less time to spend on safety if you want to deploy before someone destroys the world.
I think you should probably note where people (who are still sold on AI risk) often disagree.
If I had a list of 5-10 resources that folks like Paul, Holden, Ajeya, Carl, etc. see as the main causes for optimism, I'd be happy to link those resources (either in a footnote or in the main body).
I'd definitely include something like 'survey data on the same population as my 2021 AI risk survey, saying how much people agree/disagree with the ten factors", though I'd guess this isn't the optimal use of those people's time even if we want to use that time to survey something?
One of the options in Eliezer's Manifold market on AGI hope is:
The tech path to AGI superintelligence is naturally slow enough and gradual enough, that world-destroyingly-critical alignment problems never appear faster than previous discoveries generalize to allow safe further experimentation.
When I split up probability mass a month ago between the market's 16 options, this one only got 1.5% of my probability mass (12th place out of the 16). This obviously isn't the same question we're discussing here, but it maybe gives some perspective on why I didn't single out this disagreement above the many other disagreements I could devote space to that strike me as way more relevant to hope? (For some combination of 'likelier to happen' and 'likelier to make a big difference for p(doom) if they do happen'.)
The rate of progress seems very fast and it seems plausible that AI systems will race through the full range human reasoning ability over the course of a few years. But, this is hardly 'likely to blow human intelligence out of the water immediately, or very soon after its invention'.
... Wait, why not? If AI exceeds the human capability range on STEM four years from now, I would call that 'very soon', especially given how terrible GPT-4 is at STEM right now.
The thesis here is not 'we definitely won't have twelve months to work with STEM-level AGI systems before they're powerful enough to be dangerous'; it's more like 'we won't have decades'. Somewhere between 'no time' and 'a few years' seems extremely likely to me, and I think that's almost definitely not enough time to figure out alignment for those systems.
(Admittedly, in the minority of worlds where STEM-level AGI systems are totally safe for the first two years they're operational, part of why it's hard to make fast progress on alignment is that we won't know they're perfectly safe. An important chunk of the danger comes from the fact that humans have no clue where the line is between the most powerful systems that are safe, and the least powerful systems that are dangerous.)
Like, it's not clear to me that even Paul thinks we'll have much time with STEM-level AGI systems (in the OP's sense) before we have vastly superhuman AI. Unless I'm misunderstanding, Paul's optimism seems to have more to do with 'vastly superhuman AI is currently ~30 years away' and 'capabilities will improve continuously over those 30 years, so we'll have lots of time to learn more, see pretty scary failure modes, adjust our civilizational response, etc. before AI is competitive with the best human scientists'.
But capabilities gains still accelerate on Paul's model, such that as time passes we get less and less time to work with impressive new capabilities before they're blown out of the water by further advances (though Paul thinks other processes will offset this to produce good outcomes anyway); and these capabilities gains still end up stratospherically high before they plateau, such that we aren't naturally going to get a lull to safely work with smarter-than-human systems for a while before they're smart enough that a sufficiently incautious developer can destroy the world with them.
Maybe I'm misunderstanding something about Paul's view, or maybe you're pointing at other non-Paul-ish views...?
So, this argument seems mostly circular
I don't think your claim makes the argument circular / question-begging; it just means there's an extra step in explaining why and how a random action sequence destroys the world.
Maybe you mean that I'm putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the 'instrumental convergence' danger? If so, what do you think that feature is?
From my current perspective I think the core problem really is that most random short plans that succeed in sufficiently-hard tasks kill us. If the causal process by which this happens includes building a powerful AI optimizer, or building an AI that builds an AI, or building an AI that builds an AI that builds an AI, etc., then that's interesting and potentially useful to know, but that doesn't seem like the key crux to me, and I'm not sure it helps further illuminate where the danger is ultimately coming from.
(That said, I don't expect the plan to necessarily literally kill all humans, just to takeover the world, but this is due to galaxy brained trade and common sense morality arguments which are mostly out of scope and shouldn't be a thing people depend on.)
Very happy to hear someone with an idea like this who explicitly flags that we shouldn't gamble on this being true!
I partly answered that here, and I'll edit some of this into the post:
I'm not sure what the right percentile to target here is -- maybe we should be looking at the top 5% of Americans with STEM PhDs? Where Americans with STEM PhDs maybe are at the top 1% of STEM ability for Americans?
I want it to include the ability to run experiments and use physical tools.
I don't know what the "basic mental machinery" required is -- I think GPT-4 is missing some of the basic cognitive machinery top human scientists use to advance the frontiers of knowledge (as opposed to GPT-4 doing all the same mental operations as a top scientist but slower, or something), but this is based on a gestalt impression from looking at how different their outputs are in many domains, not based on a detailed or precise model of how general intelligence works.
One way of thinking about the relevant threshold is: if you gave a million chimpanzees billions of years to try to build a superintelligence, I think they'd fail, unless maybe you let them reproduce and applied selection pressure to them to change their minds. (But the latter isn't something the chimps themselves realize is a good idea.)
In contrast, top human scientists pass the threshold 'give us enough time, and we'll be able to build a superintelligence'.
If an AI system, given enough time and empirical data and infrastructure, would eventually build a superintelligence, then I'm mostly happy to treat that as "STEM-level AGI". This isn't a necessary condition, and it's presumably not strictly sufficient (since in principle it should be possible to build a very narrow and dumb meta-learning system that also bootstraps in this way eventually), but it maybe does a better job of gesturing at where I'm drawing a line between "GPT-4" and "systems in a truly dangerous capability range".
(Though my reason for thinking systems in that capability range are dangerous isn't centered on "they can deliberately bootstrap to superintelligence eventually". It's far broader points like "if they can do that, they can probably do an enormous variety of other STEM tasks" and "falling exactly in the human capability range, and staying there, seems unlikely".)
I tend to think of us that way, since top human scientists aren't a separate species from average humans, so it would be hard for them to be born with complicated "basic mental machinery" that isn't widespread among humans. (Though local mutations can subtract complex machinery from a subset of humans in one generation, even if it can't add complex machinery to a subset of humans in one generation.)
Regardless, given how I defined the term, at least some humans are STEM-level.
The weakest STEM-level AGIs couldn't do a pivotal act; the reason I think you can do a pivotal act within a few years of inventing STEM-level AGI is that I think you can quickly get to far more powerful systems than "the weakest possible STEM-level AGIs".
The kinds of pivotal act I'm thinking about often involve Drexler-style feats, so one way of answering "why can't humans already do pivotal acts?" might be to answer "why can't humans just build nanotechnology without AGI?". I'd say we can, and I think we should divert a lot of resources into trying to do so; but my guess is that we'll destroy ourselves with misaligned AGI before we have time to reach nanotechnology "the hard way", so I currently have at least somewhat more hope in leveraging powerful future AI to achieve nanotech.
(The OP doesn't really talk about this, because the focus is 'is p(doom) high?' rather than 'what are the most plausible paths to us saving ourselves?'.)
In an unpublished 2017 draft, a MIRI researcher and I put together some ass numbers regarding how hard (wet, par-biology) nanotech looked to us:
(500 VNG research years = 500 von-Neumann-group research year, defined as 'how much progress ten copies of John von Neumann would make if they worked together on the problem, hard, for 500 serial years'.)
This is also why I think humanity should probably put lots of resources into whole-brain emulation: I don't think you need qualitatively superhuman cognition in order to get to nanotech, I think we're just short on time given how slowly whole-brain emulation has advanced thus far.
With STEM-level AGI I think we'll have more than enough cognition to do basically whatever we can align; but given how tenuous humanity's grasp on alignment is today, it would be prudent to at least take a stab at a "straight to whole-brain emulation" Manhattan Project. I don't think humanity as it exists today has the tech capabilities to hit the pause button on ML progress indefinitely, but I think we could readily do that with "run a thousand copies of your top researchers at 1000x speed" tech.
(Note that having dramatically improved hardware to run a lot of ems very fast is crucial here. This is another reason the straight-to-WBE path doesn't look hopeful at a glance, and seems more like a desperation move to me; but maybe there's a way to do it.)