The following are some comments I gave on Open Philanthropy Senior Research Analyst Joe Carlsmith’s Apr. 2021 “Is power-seeking AI an existential risk?”, published with permission and lightly edited. Joe replied; his comments are included inline. I gave a few quick replies in response, that I didn't want to worry about cleaning up; Rob Bensinger has summarized a few of them and those have also been added inline.
I think Joe Carlsmith's report is clear, extensive, and well-reasoned. I also agree with his conclusion, that there's at least a 5% chance of catastrophic risk from AI by 2070. In fact, I think that number is much too low. I'll now attempt to pinpoint areas of disagreement I have with Joe, and put forth some counterarguments to Joe's position.
Warning: this is going to be a bit quick-and-dirty, and written in a colloquial tongue.
I'll start by addressing the object-level disagreements, and then I'll give a few critiques of the argument style.
On the object level, let's look at Joe's "shorter negative" breakdown of his argument in the appendix:
Shorter negative: By 2070:1. It will become possible and financially feasible to build APS AI systems. 65%2. It will much more difficult to build APS AI systems that would be practically PS-aligned if deployed than to build APS systems that would be practically PS-misaligned if deployed, but which are at least superficially attractive to deploy anyway | 1. 35%3. Deployed, practically PS-misaligned systems will disempower humans at a scale that constitutes existential catastrophe | 1-2. 20%Implied probability of existential catastrophe from scenarios where all three premises are true: ~5%
1. It will become possible and financially feasible to build APS AI systems.
2. It will much more difficult to build APS AI systems that would be practically PS-aligned if deployed than to build APS systems that would be practically PS-misaligned if deployed, but which are at least superficially attractive to deploy anyway | 1.
3. Deployed, practically PS-misaligned systems will disempower humans at a scale that constitutes existential catastrophe | 1-2.
Implied probability of existential catastrophe from scenarios where all three premises are true: ~5%
My odds, for contrast, are around 85%, 95%, and 95%, for an implied 77% chance of catastrophe from these three premises, with most of our survival probability coming from "we have more time than I expect". These numbers in fact seem a bit too low to me, likely because in giving these very quick-and-dirty estimates I failed to account properly for the multi-stage fallacy (more on that later), and because I have some additional probability on catastrophe from scenarios that don't quite satisfy all three of these conjuncts. But the difference between 5% and 77% is stark enough to imply significant object-level disagreement, and so let's focus on that first, without worrying too much about the degree.
"we have more time than I expect"
Joe Carlsmith: I'd be curious how much your numbers would change if we conditioned on AGI, but after 2070.
I have some additional probability on catastrophe from scenarios that don't quite satisfy all three of these conjuncts
Joe Carlsmith: Would be curious to hear more about these scenarios. The main ones salient to me are "we might see unintentional deployment of practically PS-misaligned APS systems even if they aren’t superficially attractive to deploy" and "practically PS-misaligned APS systems might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity)".
Maybe also cases where alignment is easy but we mess up anyway.
Before I dive into specific disagreements, a bit of background on my model of the world. Note that I'm not trying to make a large conjunctive argument here, these are just a bunch of background things that seem to be roughly true-ish to me and that inform where I'm coming from in my following criticisms. Note also that Joe mentioned / acknowledged many of these points (or related points).
1. Much of my strategic thinking about advanced AI systems revolves around the notion of "decisive strategic advantage" -- it's all well and good if your AI can make a bunch of money on the stock market, but one capable of wiping out the entirety of humanity in an afternoon is a different story altogether.
2. The bottleneck on decisive strategic advantages is very likely cognition (of a deep and high-quality variety).
3. The abilities of a cognitive system likely scale non-continuously with the depth and quality of the cognitions.
[Partial summary of Nate’s reply: Once dangerous AGI systems exist, if you lose control of one, it may well get scarier than you expect surprisingly quickly because of things like recursive self-improvement.
In worlds where AGI is developed in an ML paradigm like the current one, Nate's models do not strongly rely on (nor strongly rule out) the possibility that AGI development is sped up by the (pre-AGI) fruits of AI progress.]
if you can understand protein folding well enough to get 90% through the reasoning of how your nanomachines will operate in the real world, that doesn't let you build nanomachines that have 90% of the impact of ones that are successfully built to carry out a particular purpose.
Joe Carlsmith: Not sure about this -- feels plausible to me that being able to "almost" build nano-tech (at some specified level of quality) allows you do some pretty cool stuff. Do you think there's some important difference here between e.g. nano-tech and e.g. microsoft word? It's true that there's a big difference between a 90% functional MS-word (depending on what 10% we remove), and a 100% functional one -- but you still see versions of MS-word steadily increasing in quality over time, rather than a discontinuous jump. You could make similar arguments about e.g. planes. In some sense there's a discontinuous difference between "can fly" and "can't fly," but you still start with pretty crappy planes (from AI impacts, https://aiimpacts.org/discontinuous-progress-in-history-an-update/: "Powered heavier-than-air flight got started in 1903, but at first planes only traveled hundreds of feet, and it took time to expand that to the 1600 or so miles needed to cross the Atlantic in one hop"), which then get better over time.
One can claim that specific technologies will especially amenable to discontinuities; but it felt like, here, you wanted to make a broader point about cognition per se.
I expect I could do a lot with 100,000 trained-software-engineer-hours, that I cannot do with 1,000,000 six-year-old hours.
Joe Carlsmith: I find this sort of example more convincing. In particular, it seems plausible to me that as a matter of empirical fact, there's a big gap in the quality of cognition in e.g. a 1e15 parameter model and a 1e16 parameter model, such that we move sufficiently fast across the gap that we see dramatic increases in capability across a given unit of scaling.
Joe Carlsmith: I'm most used to thinking about claims like 1-3 in the context of a picture where, shortly after we start developing better-than-human systems (or perhaps, shortly after we reach some-harder-to-pinpoint capability threshold?), there will be a single AI system that can kill all humans and take over the world fairly easily, even despite the level of resistance/competition coming from other parts of the world also scaling up their own AI capabilities, trying to defend themselves, trying to stabilize the situation, etc. Does that sound like a reasonable description of your mainline scenario?
If so, I'd be curious to break down your levels of optimism/pessimism in worlds where we vary some of these assumptions -- e.g., we make take-off slower, we make DSAs relative to the rest of the world harder to get, we make the gap between "AGI/TAI" and "can build killer nano-tech using mostly its mind and readily available tools" bigger, etc. I expect that stuff in this vicinity is an important source of disagreement.
Ok, great. Next, a caveat:
Joe focuses on systems with the following three properties:
Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation).Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.
I have my own concept of "AI systems worth worrying about", that does not quite fit cleanly into this definition. I am basically just going to run with my own concept, instead of trying to contort my thinking around Joe's. I expect this to mostly go fine, as my sense is that Joe was, with his definitions, mostly trying to wave in the direction of the cluster of things that my concept also covers. That said, some of our apparent disagreement might come from the fact that I'm using my concept, and he's using his narrower concept.
My broader concept, for the record, is something more like "AI systems that can carry out long and deep cognitive chains, sufficient for accessing the Scary Stuff", where the "AI systems" part connotes that it's more like an alien mind than like an extension of a human's will (which rules out, eg, whole brain emulation or genetically engineered super-smart human beings).
Some places where I suspect our concepts differ, in ways that are perhaps relevant:
1. I don't think strategic awareness matters all that much, from the perspective of Team Earth. Or in other words: the AIs that are likely to kill us, are likely strategically aware, but if we successfully deploy AIs in a way that ends the acute risk period, I think it's about even-odds as to whether they're strategically aware, and the question doesn't factor much into my capability estimations.
2. I suspect I think "agentic planning" is a much weaker condition than Joe does.
There are hugely many ways to reduce your predictive error when thinking about protein folding, only a few of which matter for the ability of the resulting nanosystem to carry out some desired behavior in the real world
Joe Carlsmith: Not sure I've understood this point. What's the difference here between the methods of reducing your predictive error that "matter" to the nanosystem's functioning, and the ones that don't? Maybe an example would help.
[Partial summary of Nate’s reply: Suppose that you're trying to design a protein factory so that you can build a nanofactory. This is a tricky engineering task. As a subproblem, you want to develop the ability to predict protein interactions, with the intent of being able to use your models when designing open-air nanotech.
You train a large model on protein interactions, using the mean-square distance of each modeled-atom-position from its corresponding real-world position. Most atoms, however, are moving rapidly in the middle of some enormous enzyme, while a couple are on the binding site.
Among the ones on the binding site, all that matters is figuring out whether the enzyme reacts with the catalyst you are considering using to carry out some specific function in your protein factory. Your model, however, doesn't care about that; it is minimizing the mean-square distance of all atoms, and therefore spends all day learning about the behaviors of atoms in the middle of the enzyme.
The world is extraordinarily detailed. On Nate's view, if we want a plan that works in real life, we have to concentrate cognition on the relevant questions. As Nate sees it, asking for a "non-consequentialist but working plan" is a bit like asking for a "way of drawing an accurate picture of a maze without observing it".]
any AI system that is capable of designing a system that carries out a specific behavior in the real world, is somehow focusing its cognition on the questions of consequential behavior, in an action I'd call 'planning' 'in pursuit of an objective'."
Joe Carlsmith: I'm pretty sympathetic to this, provided that the "planning" in question allows for the type of predictions about the system's behavior required for the instrumental convergence argument to go through (the instrumental convergence arg is ultimately why I care about agentic planning and strategic awareness).
That is, I think the question is whether the type of planning strictly required to e.g. build a functioning nano-system supports the instrumental convergence arg (for example, can you build nano-tech in a way that is intuitively 'sphexish'? If not, is this centrally about the complexity of the task? Evolution can build squirrels, for example, that hide their nuts in trees for the winter, but that don't take other actions that might help them survive for the winter (or whatever; I don't actually know the squirrel science); we can train AlphaStar to execute cool plans in a StarCraft environment, but in a way that wouldn't generalize to strategic action in other contexts; and so on.)
unfettered access to a motor system
Joe Carlsmith: Fwiw, this isn't necessary for agentic planning in my sense: see e.g. the bullet points in 2.1.2: https://docs.google.com/document/d/1smaI1lagHHcrhoi6ohdq3TYIZv0eNWWZMPEy8C8byYg/edit#heading=h.70ean6ha5tu6.
producing schematics for nanomachines that you synthesize yourself
Joe Carlsmith: I do think there's an important question, though, about how exactly those schematics were optimized. That is, you can imagine an AI (system A) with the myopic goal of producing, on this single pass through the network, a plan for a nano-system that would receive high ratings according to some other system's (system B) predictions about whether that plan would result in a nanosystem that works. System A has an opportunity to exert causal influence on the world, yes, via these planned nano-systems getting built; but the nano-systems themselves aren't optimized directly for implementing some world-take-over scheme. Rather, they are optimized such that the plans they get built from receive high ratings from system B.
3. I suspect Joe has some concept of advanced AI systems that are not "agentic planners", and that have a different catastrophe-profile.
The most obvious example people in our neck of the woods often discuss is systems that are "deferential to humans" in some deep and persistent way.
Joe Carlsmith: I wouldn't be thinking of "deferential systems" as non-APS. A perfect butler, for example, makes plans in pursuit of objectives, with strategic awareness. Same with a perfect AI personal assistant.
To me, the most salient "no APS systems" scenarios involve intuitively "sphexish" systems whose 'planning' is sufficiently brittle/model-free/domain-limited that it doesn't generalize in response to power-seeking type incentives. I think of Drexler's "CAIS" vision as embodying something like this vibe. When I imagine this, I imagine the way I feel about image classifiers (e.g., "in a fairly structural way, this isn't the type of system we need to be worrying about" -- though this need not hold if the system is scaled up to arbitrary degrees) turning out to apply to many more systems than I'm inclined to expect by default.
I also have some probability on "this whole vaguely-rational-agent-EU-maximizer way of thinking about AI systems is misguided somehow." The report isn't explicitly framed in terms of maximizing utility functions, but I expect that similar background concepts/assumptions are sneaking in at various points, and I think this may well just result in confusion/mis-prediction at levels that can be difficult to anticipate/conceptualize.
it's possible that Joe does, and that as such, the second bullet point above is doing a bunch of work for him that it's not doing for me.
Joe Carlsmith: Optimism/pessimism about Paul's approach isn't playing a role in the probability I assign to "timelines" or "incentives to build APS systems". I think of Paul's work as centrally about building aligned APS systems, rather than as trying to avoid APS-ness. Related to the "perfect butlers are still agentic planners" point above.
Now, on to critiquing Joe's probabilities:
Re (1), on timelines, observe that my timelines are significantly more aggressive than Joe's. I'm not sure where the disagreement lies, but I can list a handful of things that drive my probability of AGI-in-the-next-half-century-or-so above 80%:
1. 50 years ago was 1970. The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap, even before accounting the recent dramatic increase in the rate of progress, and potential future increases in rate-of-progress as it starts to feel within-grasp.
2. I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn't do -- basic image recognition, go, starcraft, winograd schemas, simple programming tasks. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer programming that is Actually Good? Theorem proving? Sure, but on my model, "good" versions of those are a hair's breadth away from full AGI already. And the fact that I need to clarify that "bad" versions don't count, speaks to my point that the only barriers people can name right now are intangibles.) That's a very uncomfortable place to be!
3. When I look at the history of invention, and the various anecdotes about the Wright brothers and Enrico Fermi, I get an impression that, when a technology is pretty close, the world looks a lot like how our world looks.
4. Summarizing the above two points, I suspect that I'm in more-or-less the "penultimate epistemic state" on AGI timelines: I don't know of a project that seems like they're right on the brink; that would put me in the "final epistemic state" of thinking AGI is imminent. But I'm in the second-to-last epistemic state, where I wouldn't feel all that shocked to learn that some group has reached the brink. Maybe I won't get that call for 10 years! Or 20! But it could also be 2, and I wouldn't get to be indignant with reality. I wouldn't get to say "but all the following things should have happened first, before I made that observation!". Those things have happened. I have made those observations.
5. It seems to me that the Cotra-style compute-based model provides pretty conservative estimates. For one thing, I don't expect to need human-level compute to get human-level intelligence, and for another I think there's a decent chance that insight and innovation have a big role to play, especially on 50 year timescales.
6. There has been a lot of AI progress recently. When I tried to adjust my beliefs so that I was positively surprised by AI progress just about as often as I was negatively surprised by AI progress, I ended up expecting a bunch of progress.
There are other arguments against short timelines that I don't find very compelling (eg "society might collapse"), there are other arguments in favor that I don't find super compelling (eg "also things might speed up a bunch when we get close & can build tools that help us speed up"). \shrug.
My overall state on timelines, having meditated upon it, is basically "wanna bet?". I'm not sure what else I could say to drill into the crux of the issue, without knowing more where Joe and I disagree.
The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap
Joe Carlsmith: I'm sympathetic at an intuitive level, but I also have wide error bars at this level of abstraction. In particular, I have in mind worlds where some task turns out to be "very hard," despite cool progress. When I try doing this sort of extrapolation with brain emulation, for example, I think it's pretty plausible that despite >50 years of pretty cool neuroscience progress, we're still less than halfway (at least in researcher-years) to emulating human brains (modulo some other automation of scientific progress more broadly), because it turns out the brain is pretty gnarly, difficult to understand, and just requires a large amount of grunt-work, data-gathering in the face of technical/logistical limitations, etc. And especially if I set aside specific views about ML being promising, it seems plausible to me that AI turns out to be like brain emulation in this respect.
See e.g. Ajeya's stuff on the limitations of "subjective impressiveness extrapolation": https://docs.google.com/document/d/1cCJjzZaJ7ATbq8N2fvhmsDOUWdm7t3uSSXv6bD0E_GM/edit#heading=h.njuz93bimqty.
[gap between us and] AGI is made mostly of intangibles.
Joe Carlsmith: Though I think there are some selection effects, where we have incentives to develop better tangible benchmarks re: what AI systems can't do, as they become more capable and can do more stuff.
And in a basic sense, if you look out at the space of tasks humans actually perform in the world, there's a huge amount of stuff that AIs can't do. Here I'm expecting you'll lean on all of these tasks showing up when we're a hair's breadth away from full AGI already, but it seems like this requires substantive background hypotheses about the trajectory of AI capability development, such that you expect human cognitive labor to mostly get automated "all at once" rather than in a gradually increasing way. And there I'd want to know what's underlying those hypotheses. For example, on Ajeya's "horizon lengths"-centric picture, everything doesn't come "all at once" -- rather, you get "short" tasks substantially before you get "longer" ones. And we can imagine further limitations on e.g. data, environments, compute, etc, even if in some sense we know how to do it "in principle."
it looks like experts pointing to specific technical hurdles
Joe Carlsmith: Though it feels like the ability to point to technical hurdles benefits significantly from understanding what you need to do well enough to point to specific hurdles that could get in your way, and it's not clear to me that we're there with AI. E.g., with something like, say, very fast interstellar space-craft, we have enough understanding of the technology and basic dynamics to say stuff about required fuel, materials, etc. But with AI it feels like we're groping around to a degree that I wouldn't trust to recognize technical hurdles even if they were there. And when we do condition on a specific pathway we think will get us there -- e.g., "X method of scaling up ML systems will basically work" -- there are indeed specific candidate hurdles re: compute, data, environments, etc that emerge from rough extrapolation from current scaling laws and other assumptions (though it's definitely possible to debate exactly what to expect in these respects).
exited that regime a few years ago.
Joe Carlsmith: What are the technical hurdles you think we cleared a few years ago? Stuff like image recognition and Go? I would've thought that you could've made arguments like this for decades, and that re: image recognition and go, you would've just said "we don't know of any specific technical hurdles to automating image recognition and go, so who knows when it'll get done."
Human-brain level hardware is one candidate, but how much this is really a technical barrier has always been an open question (as has the relevant quantity). And from your comments below, it doesn't sound like it's your focus.
but all the following things should have happened first, before I made that observation!".
Joe Carlsmith: Before e.g. image recognition and Go, would you have said "but image recognition and Go should have happened first"?
In that case, I start to wonder about how to tell the difference between image recognition/go, and the plethora of tasks we can't currently do.
Overall though I'm sympathetic to "I just wouldn't be that shocked by 5 years" (for me these are centrally "scaling up GPT-3 type systems works" worlds). 2 is pushing it for me, especially for "can easily nano-tech you" systems.
[Partial summary of Nate’s reply: Nate does think that "all this 'AI' field can do is brute-search look a bunch of ply deeper into a game tree than a human; it can't even parse up a visual field into shapes and edges in a manner that allows it to practically use that information" is a reasonably solid "we aren't on the brink of AGI" argument. Nate doesn't feel that the remaining hurdles have the same caliber of "I was supposed to see that happen many years before I saw AGI".
Nate concedes that his statement was slightly exaggerated -- there are certain brands of theorem-proving such that Nate would be willing to stick his neck out and say "this comes just before the end".
In 2009, however, Nate reports that you could find people who would scoff and say "AI can't even X, it must be far". Now, if you press people for analogous hurdles, people's response is to try to weasel out of it, because they just don't know. Nate is trying to point to a different social atmosphere, where if you were at conferences before 2014 and after 2014, you would notice a shift from it being cool to boldly claim that true AI is obviously far off because of X, Y, and Z, to people nervously avoiding being pinned down on the topic of what will not be doable in a few years' time.
None of this, on Nate's view, means that AGI is definitely right around the corner. But Nate takes it to mean that he doesn't get to be indignant if reality says "and all the rest of the stuff falls very quickly once you have one or two more key insights".]
there's a decent chance that insight and innovation have a big role to play
Joe Carlsmith: Though note that Ajeya's report includes an adjustment for algorithmic progress: https://docs.google.com/document/d/1cCJjzZaJ7ATbq8N2fvhmsDOUWdm7t3uSSXv6bD0E_GM/edit#heading=h.epn531rebzyy
There has been a lot of AI progress recently. When I tried to adjust my beliefs so that I was positively surprised by AI progress just about as often as I was negatively surprised by AI progress, I ended up expecting a bunch of progress.
Joe Carlsmith: Broadly sympathetic to this.
"society might collapse
Joe Carlsmith: Fwiw, this isn't playing any role for me.
also things might speed up a bunch when we get close & can build tools that help us speed up
Joe Carlsmith: Interesting. As I mentioned above, something like this plays an important role in some models OP/our advisors are interested in, and it feels to me like an important question re: take-off dynamics more broadly. Not sure it's the best use of time re: this report in particular, but would be interesting to hear your take in more detail at some point.
Joe Carlsmith: My guess is that debating timelines isn't the best use of our time for present purposes, as I didn't spend much time on them in the report, relative to arguments for doom conditional on timelines. That said, to the extent that you're at 85% on a pretty demanding conception of AGI (e.g., "a single system that can nano-tech all humans very easily, starting from not-especially-different-from-today levels of knowledge and tech") by 2070, I do feel open to betting. I'm hoping to work out my own take on timelines better, either in the process of revising the report or sometime after (in the current version I'm mostly leaning on work by other folks), though, so would prefer to wait to see where I end up after that.
Re (2), on it being comparatively difficult to build aligned systems, I suspect a bunch of our disagreement is in the "at least superficially attractive to deploy" clause. In particular, I basically expect the "endgame" to be a shitshow, full of events at least as crazy-seeming-to-us as:
And more realistically, quite plausibly as crazy as "running the AGI on internet-connected computers, unfettered".
Like, on my model, the AI doesn't need to look all that superficially good. The relevant actors will feel forced into corners where the good-looking-ness is only tangentially relevant. The social, political, and psychological forces pushing people to persuade themselves that things look good despite the superficial evidence to the contrary will likely be large. The future is likely to look derpy rather than competent; I predict it will be more like a WWII story from Catch 22 than a WWII story from Saving Private Ryan (eg, a story where everyone is fumbling and bumbling, rather than a story in which everyone is dutifully and honorably carrying out a well-defined purpose).
I suspect we have some object-level disagreement on this point, though I'm not quite sure where. Perhaps Joe disagrees with the view that the modern world is more like Catch 22 than Saving Private Ryan? Perhaps he expects it to change in the endgame? My counters are roughly "behold the 2016 elections" and "behold the world response to the coronavirus". To get into more detail, I'd want some back-and-forth.
The AI system is drawing up plans to Kill All Humans. The decision is to retrain the system until it doesn't seem to be drawing up such plans. Our community cautions that this will not train out the behavior, but only its appearance in our shallow diagnostic tools. The plan goes forward anyway.
Joe Carlsmith: I think maybe we see stuff analogous to this; but I think we do have some different levels of pessimism here. That is, I think that if e.g. Demis/Sam literally observes an AI system drawing up plans that literally involve killing all the humans (e.g., "step 3: here all the nano-bots burst forth from the human bloodstreams"), I think there would be a substantially bigger freak-out than just "ok let's retrain it until it doesn't make plans like this." For less blatant forms of bad-intended-behavior, though, it's less clear. The type of race dynamic we're in also matters here.
The AI system is clearly not safe for being deployed, but a state actor already has a copy of the code, and they've deployed it in a risky way, and will probably deploy it in an even more risky way next week, and the team decides that their deployment is overall less risky than that.And more realistically, quite plausibly as crazy as "running the AGI on internet-connected computers, unfettered".
The AI system is clearly not safe for being deployed, but a state actor already has a copy of the code, and they've deployed it in a risky way, and will probably deploy it in an even more risky way next week, and the team decides that their deployment is overall less risky than that.
And more realistically, quite plausibly as crazy as "running the AGI on internet-connected computers, unfettered".
Joe Carlsmith: This seems pretty imaginable.
Perhaps Joe disagrees with the view that the modern world is more like Catch 22 than Saving Private Ryan?
Joe Carlsmith: I think we probably do have some background disagreement here. E.g., I expect that you are more broadly pessimistic about the world's competence than I am, even though I don't think the world is a saving-private-ryan situation. Might be interesting to try to pin down the disagreement into specific predictions. The "what do people do if they literally see the AI system trying to kill people" case is maybe one example.
And note that some things one might class as "incompetence" can push in the good direction in the case of AI risk: e.g., people being naturally suspicious of new/scary/ethically weird technology (cf. nuclear power, cloning), very strong backlash to small incidents (e.g., nuclear power, accidents with self-driving cars), intense regulation, etc.
"behold the 2016 elections" and "behold the world response to the coronavirus"
Joe Carlsmith: Agree that these are instructive data points.
Another point of potential disagreement is that I don't think AI alignment needs to be all that hard to make proposition (2) true-enough-to-kill-us. Like, I'm not expecting alignment to be a feat of herculean difficulty unlike any technical obstacle humanity has ever faced in the past. I'm just expecting it to be hard enough that we can't pull it off between the time the first team gets a sufficiently capable AGI, and the time that someone else deploys one (or they do, either out of derpy incompetence or crappy game theoretic response to an unwinnable gameboard).
From my perspective, the claim "maybe it won't be very hard to align these things" sounds a lot like "maybe my code won't have bugs". It seems like sheer blind optimism, in the face of evidence like "if you train humans real hard for reproductive fitness they still invent condoms", to expect alignment to be so easy that it can be carried out last-minute, in the window between having-AGI and the end of the world.
Perhaps Joe expects alignment to be solved before that time window? (I don't; it hasn't been going very well so far, from my perspective. Perhaps we have a disagreement about how promising the research currently being done is, or how promising it's going to get before the endgame.)
Perhaps Joe expects that time window we'll have is significantly and relevantly longer than I expect? (I expect that would take a huge effort of civilization-wide coordination that our civilization seems categorically incapable of organizing.)
Perhaps Joe thinks the relevant time window will be long enough without a massive coordination effort, because he expects all relevant researchers will understand that they're handling a civilization-ending catastrophe device rather than a petulant system that will turn into a pot of gold if it gets shaken up a bit? (I don't; I expect the AI researcher culture to be about as competent around alignment issues as they are now, or near enough as makes no difference.)
Perhaps Joe thinks that alignment is so easy that it can be solved in a short time window?
My main guess, though, is that Joe is coming at things from a different angle altogether, and one that seems foreign to me.
Attempts to generate such angles along with my corresponding responses:
I dunno, I might be able to generate more by thinking hard about it, but it would be much easier to find someone who disagrees (perhaps Joe) and have a bit of a back-and-forth.
Joe expects that time window we'll have is significantly and relevantly longer than I expect?
Joe Carlsmith: Depending on "relevantly longer," I think there's some of this. By default I expect years of work with systems that are sufficiently APS that you're actually getting relevant empirical data about the real problem, learning important stuff, making progress, and so on.
That is, in my mainline model you don't need to wait to develop some super advanced system, then put things "on pause." Your empirical alignment work is getting incrementally more useful along the way (and you're getting incrementally more useful assistance from AI tools, weaker agents, etc).
thinks the relevant time window will be long enough without a massive coordination effort, because he expects all relevant researchers will understand that they're handling a civilization-ending catastrophe device rather than a petulant system that will turn into a pot of gold if it gets shaken up a bit?
Joe Carlsmith: I do tend to expect significant delays between development of a given AI system, and its large-scale intentional deployment in the real world, at least for many applications. For example, I don't think the military will just take the first advanced AI system it gets its hands on and put all the nukes in its hands. See also self-driving cars, delays in the adoption of electricity, etc. But I think this centrally due to expecting various frictions/adoption delays/regulatory hurdles/bureaucratic problems, plus fear-of-new-tech dynamics, rather than "everyone is super responsible and cautious re: alignment X-risk in particular."
Perhaps Joe things that alignment is so easy that it can be solved in a short time window
Joe Carlsmith: I do have some probability that the alignment ends up being pretty easy. For example, I have some probability on hypotheses of the form "maybe they just do what you train them to do," and "maybe if you just don't train them to kill you, they won't kill you." E.g., in these worlds, non-myopic consequentialist inner misalignment doesn't tend to crop up by default, and it's not that hard to find training objectives that disincentivize problematically power-seeking forms of planning/cognition in practice, even if they're imperfect proxies for human values in other ways.
Joe Carlsmith: My main guess is more like: I place more weight than you on comparatively optimistic setting for variety of different variables -- e.g., alignment easiness, timelines, take-off speed/discontinuity, possibility of non-APS systems being the main thing, ease of DSA/nano-teching, civilizational competence (or over-caution-type incompetence that ends up being helpful), correction ability, some crucial aspect of this discourse being confused/mistaken -- such that I end up with significant credence on "no" for lots of premises where you're at >90%, and this adds up.
perhaps it's just not that hard to train an AI system to be "good" in the human sense? Like, maybe it wouldn't have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?
Joe Carlsmith: I think something like this is in the mix for me. That is, I don't see the evolution example as especially strong evidence for how hard inner alignment is conditional on actually and intelligently trying to avoid inner misalignment (especially in its scariest forms).
Change-Of-Distribution to operating in the real world
Joe Carlsmith: One question I have here is whether "operating in the real world" really ends up as a binary/irrevocable switch. That is, it seems like all along the way, we are probably going to be getting data about what it's like to deploy systems in the real world, learning about what goes wrong, deploying them in limited/controlled contexts, revoking their spheres of influence when we see problems, etc.
Of course, for any given choice to develop an AI system or extend its influence, including e.g. letting beta-users access the GPT-3 API, there is some probability that "that step was the fuck-up and now everyone is going to die." But the question of what we should actually expect that probability to be at each actual stage, given the work done and understanding gained by that point, seems like a further question -- and I worry that imagining a "big red deploy button" that we press and then "cross our fingers because we basically have no data about how this is going to go" will end up a misleading image.
perhaps there's just a handful of relevant insights, and new ways of thinking about things, that render the problem easy?
Joe Carlsmith: This isn't a big part of it for me. I don't expect some conceptual "click" that solves the problem.
Re (3), on whether deployed misaligned systems are catastrophic, it looks to me like I have two points of disagreement with Joe:
1. I suspect I think that the capability band "do a trillion dollars worth of damage, but don't Kill All Humans" is narrower / harder to hit.
2. I suspect we disagree about how much warning shots help civilization get its act together and do better next time.
With respect to the first point, this might be because I'm somewhat more bullish about rapid capability gain, or it might be due to some of the points listed above about the usefulness of cognition changing sharply as the quality of cognition increases slowly (shitty analogy: if "1 million 6-year old hours" is still useless, but "1 million 14-year old hours" is not, and we spend all our time going from rat-level capabilities in some domain to monkey-level capabilities in that domain, we might blow straight past the relevant threshold in a weekend). As such, on my picture, trillion-dollar warning shots just don't happen all that often, and so even if society would get its act together in the face of one, we won't have the opportunity.
(Or, well, a trillion dollars of damage is not all that huge a number these days, but whatever, we can work with 10 trillion instead.)
Joe Carlsmith: I feel sympathetic to points in this vein, and have been thinking for a bit about revising my probability on premise 5 to reflect greater correlation between "trillion dollars of damage worlds" and "full disempowerment worlds."
And of course I also disagree that society would get its act together in the face of warning shots. As case-in-point, I exhibit the global response to the coronavirus: it was a 10 trillion dollar warning shot about pandemics. Is society now going to get its act together with regards to biological risks? Is gain-of-function research going to be stopped, conclusively, world-wide? Is machinery for rapid development and deployment of vaccines going to be built and maintained? Get your bets in now!
Because my bet is: lol no. Not even close. And if civilization can't ban gain-of-function research, when it has only very dubious benefits and huge risks, and when the forces arrayed in favor are a mere handful of academics, then why should I expect that civilization will coordinate around making sure that AI research is safe, when AI technology is having large and visceral positive impacts on society (or at least the economy), and the forces arrayed in favor are enormous business interests with huge amounts of money to gain?
In short, if we are lucky enough for our first AI failure to be a mere warning shot rather than a civilization-ending catastrophe, I expect our civilization to do with it the same thing they do with every other warning shot: squander it completely. It probably won't even make a dent in the institutional inertia behind pushing research forward, and even if it did then the business interests arrayed behind AI research proceeding, and the psychological forces pushing researchers to believe in their research, and the sheer ease of saying things like "well that system only made that mistake because it was too stupid", will be more than enough to overpower any resistance. Or so I predict.
And if civilization can't ban gain-of-function research, when it has only very dubious benefits and huge risks, and when the forces arrayed in favor are a mere handful of academics, then why should I expect that civilization will coordinate around making sure that AI research is safe, when AI technology is having large and visceral positive impacts on society (or at least the economy), and the forces arrayed in favor are enormous business interests with huge amounts of money to gain?
Joe Carlsmith: I find this example fairly compelling.
every other warning shot
Joe Carlsmith: Civilizational reactions to much smaller nuclear disasters like Chernobyl and Three Mile Island seem like an instructive data point here.
It probably won't even make a dent in the institutional inertia behind pushing research forward, and even if it did then the business interests arrayed behind AI research proceeding, and the psychological forces pushing researchers to believe in their research, and the sheer ease of saying things like "well that system only made that mistake because it was too stupid", will be more than enough to overpower any resistance.
Joe Carlsmith: I think our intuitions here are different. I think that if a rogue AI system, for example, crashed the financial system and tried to bioweapon everyone, but only killed 50M people instead of everyone, and then was finally gotten under control via some extreme civilizational effort like turning off the entire internet and destroying tons of computers, and this was known and vivid to the world, this would put a very significant dent in the institutional/research inertia and business interest pushing for just continuing to forward in scaling up similar systems. And I really don't expect "that just happened because it wasn't smart enough let's just make it smarter."
Indeed, this is the kind of thing I can readily imagine leading to really extreme worldwide backlash, intense international coordination, research bans, nations threatening to bomb other nations if they build sufficiently big compute clusters, etc.
Now, question (3) might be a place where I'm getting a little bitten by my substitution of Joe's "APS systems" and my "Scary systems" (in which case more of our disagreement is shunted into question (2), b/c not only do I have a higher probability of AI systems being sufficiently-superficially-aligned that some fool is deploying them, but I also believe the deployed systems are drawn from a scarier class), but I also get the sense that we have a disagreement about general civilizational competence, and its ability to react in sane and reasonable-seeming ways when the stakes are high and a bunch of value is on the line.
As usual, I could dig deeper into various points where I suspect disagreement, but I'm not sure where the real disagreements lie, so for now I'll desist.
Now, critiquing the argument style: I worry that Joe's style of argumentation, of breaking a proposition into a series of (allegedly conditional) conjunctive steps and then multiplying to attain an estimate, has a tendency to give answers that are much too low. (Joe acknowledges this point himself in the report, though does not combat it to my satisfaction. This is known as the "multi-stage fallacy", and its namesake -- and an example of its use in the wild -- is exhibited by Nate Silver here: https://fivethirtyeight.com/features/donald-trumps-six-stages-of-doom/.)
Speaking roughly, if we break an argument into n conjunctive steps, and try to look "reasonably uncertain" about each step, this will tend to drive our probability of any event happening to around 1 in 2^n. In particular, with a six-step breakdown (as in the main document) the "reasonable uncertainty" drives the answer towards 1 in 64 or about 1.5%, and a three-step breakdown (as in the appendix) drives the answer towards 1 in 8 or about 12.5%. 5% is comfortably in the middle of those two ranges (and strikingly close to their geometric mean), and I worry that the probabilities assigned in the report are mostly an exercise in deploying the multi-stage fallacy.
One intuition for combating this fallacy is that we're supposed to make heavier use of disjunction as well as conjunction in our models. Another is that, when conditioning on earlier stages, our probabilities are supposed to get so extreme that our overall probability could not be driven down further by someone decomposing the claim into further conjunctions that apparently partition the space.
One intuitive motivation for the latter is that the realities that managed to get one or two surprises deep into our list of stages, likely contain some underlying force driving a correlation between all the stages, that spikes the conditional probabilities in the later stages. (In particular, I diagnose Nate Silver's error in the post linked above.)
A further intuition-pump here is that the future has a good chance of surprising us or going sideways, and if it instead falls neatly into the first one or two stages we named clearly in advance, then whatever process picked out the whole series of stages was probably onto something, and conditional on the first few stages, "model uncertainty" (and other forces driving our error bars to be "virtuously wide") is much lower in the later stages.
(Indeed, my guess is that I myself have failed to account entirely for this phenomenon, as evidenced by a guess that my quick-and-dirty numbers on Joe's six-stage argument would yield a lower implied probability than 77%. Let's check! Eyeballing the six stages and choosing numbers off the cuff to 5% precision, I assign... 85%, 100%, 90%, 95%, 95%, 100%, for a total of around 69% -- a decent difference. As I said before, my actual probability of catastrophe is larger, both from leakage to disjuncts technically excluded by Joe's breakdown, and due to a sense that my conditional probabilities would likely drift higher on reflection as I account for the multi-stage fallacy.)
It seems to me that Joe perhaps attempted to combat the multi-stage fallacy (perhaps due to critiques given by my colleagues and me, on a draft version of this report). In particular, in his appendix, Joe not only makes a 3-stage rather than 6-stage argument, but also considers a "positively-phrased" argument (in both a 3 and 6 stage breakdown).
The three-stage argument does a bit to address my concerns -- it is better to use an argument style that drives all probabilities to 12.5% than 1.5%. The "positive framing", however, does not at all address my critique. In particular, Joe's positively-phrased argument is not conjunctive, but rather disjunctive!
Just as a multi-stage conjunctive argument of length n drives all "reasonably uncertain" assignments of probabilities towards 1 in 2^n, a multi-stage disjunctive argument of length n drives all "reasonably-uncertain" assignments of probabilities towards (2^n - 1) in 2^n, ie 63/64 for 6-stage arguments and 7/8 for 3-stage arguments! So it is no consolation to me that his conjunctive 5% turns into a disjunctive 95%.
(Of course, perhaps Joe was merely attempting to combat certain framing effects, and not also the multi-stage fallacy, in which case my critique still stands, but critiques the argument proper rather than some flawed response to prior criticism.)
An example of a conjunctive, positively framed argument might be:
For humanity to make it to 2070 alive, we need all three of:
Someone attempting to be virtuously uncertain might assign probabilities like, say, 50%, 75%, and 60%, implying a mere 22.5% chance of survival. Now, I'm not saying those are my numbers (they aren't); I'm saying that this is what a "positively framed" conjunctive argument feels like. And, from my perspective, the way one counteracts the multi-stage fallacy is not to simply shorten their conjunctive breakdown or explicate its disjunctive counterpart, but rather to consider also conjunctive breakdowns of the counterpoint. The difficulty is not in reconciling one's conjunctive account of catastrophe with their disjunctive account of survival, but in reconciling a conjunctive account of catastrophe with a conjunctive account of survival.
The fact that Joe frames his catastrophe estimates conjunctively, and his survival estimates disjunctively, does little to assuage my fears that the final probability in his report is driven ultimately by his choice of which side gets the conjunctions. Joe admits in his report that these are lower bounds, but seems to feel they are not too far off. By contrast, I fear that his style of argumentation has driven them quite a bit too low.
and I worry that the probabilities assigned in the report are mostly an exercise in deploying the multi-stage fallacy.
Joe Carlsmith: A bit about my backdrop model here, which I think may differ from yours. On my general model, arguments have a kind of "true conjunctiveness/distinctiveness," which it is the task of argument formulation to capture. Thus, for example, the argument that "next new years day you will get hit by lightning while getting eaten by a shark, shortly after winning the lottery" is genuinely conjunctive: you really should be breaking this into conjuncts and estimating their probability ~independently. And there are other arguments -- "here is my overly-specific forecast about how Y technology-i-have-no-control-over will get developed in the next 5 years" -- that are more subtly like this.
Thus, it's not enough to say "thou shalt not break an argument into many stages" or "thou shalt focus on disjunctive formulations rather than conjunctive formulations." Rather, there is a kind of discriminating taste involved in knowing how to carve an argument "at the joints," in a way that brings out the amount of conjunctive-ness it actually involves, but not more, and which makes it easy to capture the correlations at stake in the premises (including the correlations implied by the fact that the person making the argument is offering it at all; e.g., if a CEO tells you an overly-specific plan that they will use to get their company to succeed, there are indeed correlations at stake re: their being able to push for the plan, their being able to identify it ahead of time, etc).
To the extent that there are underlying factors driving correlations between the premises, a good formulation brings those out, and makes them into premises in themselves. Thus, for example, if "Trump is actually a really effective campaigner" is the thing that ultimately drives his probability of getting the nomination, we should bring that out in particular and ask what probability we place on it (though you do also want to capture the worlds where trump gets the nomination without being an effective campaigner -- so you need some disjunction in that sense).
(And it's true that in the lightning/lottery example above, if you do in fact win the lottery shortly before new years day, you should update hard in favor of whoever made this argument having some sort of spooky foresight/power or whatever, thereby introducing new correlations.)
disjunction as well as conjunction in our models
Joe Carlsmith: Fwiw, I don't put a lot of weight on the idea that my premises actively miss a lot of disjunctive routes to power-seeking X-catastrophe, such that we should have a lot of probability mass on power-seeking X-catastrophe coming from worlds where one of my premises is false. In particular, it feels to me like my timelines condition is fairly minimal (and close to required for the instrumental convergence argument to go through); like the "trillion dollars of damage", "full scale disempowerment," and "disempowerment = catastrophe" are implied by basically any X-catastrophe story, and that basically all mainline doom stories involve significant incentives to build the relevant systems, and hard alignment problems.
My main candidates for scenarios that "slip through the cracks" are ones where such incentives aren't there to the right degree, and/or where alignment is easy but we mess up anyway -- but I don't feel like these scenarios are driving close to >50% of the probability mass on power-seeking doom.
That said, I do think it's worth playing around with versions of that argument that focus less on things that strictly need to happen, and more on candidate factors that could drive correlations between premises (e.g., timelines, civilizational competence, take-off, etc), but where there is still significant p(doom) even if they're false (and so the argument ends up disjunctive in that sense). I'm hoping to do more of this going forward.
A further intuition-pump here is that the future has a good chance of surprising us or going sideways, and if it instead falls neatly into the first one or two stages we named clearly in advance, then whatever process picked out the whole series of stages was probably onto something, and *conditional* on the first few stages, "model uncertainty" (and other forces driving our error bars to be "virtuously wide") is much lower in the later stages.
Joe Carlsmith: I agree with this, and would like to think more about what might play this sort of role. The main salient candidates in my mind are something like: "Eliezer is right about stuff in general," "fast/concentrated take-off," and maybe "general civilizational incompetence."
Would be curious if you think there are others worth highlighting.
That said, variables like "absolute alignment easiness" seem pretty uncorrelated, in a metaphysical sense, with variables like "timelines" and "quality of civilizational response to the problem." So my best candidate correlation mechanisms there are more epistemically-flavored: e.g., "Eliezer is right about stuff."
The three-stage argument does a bit to address my concerns -- it is better to use an argument style that drives all probabilities to 12.5% than 1.5%.
Joe Carlsmith: Fwiw, to me it feels like the intuitive argument has at least two separate stages in its deep structure: e.g., AI needs to happen, and we need to all die as a result. And I'm inclined to think three is pretty minimal as well. That is, it feels to me like something like "AI will happen by 2070," "Alignment is a problem and non-trivial to solve" and "We will fail and all die" are pretty importantly separate claims that the argument basically requires, and that reasonable people can have significantly uncorrelated uncertainties about (I also feel this way about the 6 premise argument). And I feel like three premises is sufficiently short that the charge of "artificially" extending the argument so as to drive the probability lower rings false, at least to my ear.
perhaps Joe was merely attempting to combat certain framing effects, and not also the multi-stage fallacy
Joe Carlsmith: Indeed; the positively framed version was centrally meant to address framing effects where people will think it's more virtuous to be "skeptical" in some sense, and so to put low probabilities on claims, whatever they are.
Joe Carlsmith: Thanks for offering this, I found it helpful.
Would also be interested in any other formulations you have up your sleeve, when you're trying to boil down the basic case.
Joe Carlsmith: I think this version of the argument is going to be driven centrally by the probability one assigns to the overall claim that in order for humanity to survive, one needs all of these things -- probability which pretty clearly shouldn't be 100%. For example, there is surely some probability of survival if timelines are within 20 years; some probability that a not-that-competent research community could solve a "pretty easy" version of the alignment problem; and so on.
I'd like to think more about what a version of this argument I'd endorse would look like, and what probabilities I'd assign. One nitpicky issue is that it's not immediately clear to me what "probability you need all three of these things" actually means. E.g., is it something like "probability that absent these three things, my probability on doom should be ~100%"? But given that i'm assigning some substantive probability to worlds where all three things aren't true, it seems unclear what sort of probability is at stake in the "should" above (in general, I don't like assigning subjective probabilities to subjective probabilities -- e.g., "i'm 50% that my probability is >80% that p"). That said, could maybe formulate in terms of "if I thought about it X amount more, i'd get to >Y% on 'you need all three of these things, else doom.'"
Joe Carlsmith: An alternative argument:
(1) AGI by 2070
(2) Eliezer and others at MIRI think >X% doom, conditional on AGI by 2070. What's your probability that they're right?
And, from my perspective, the way one counteracts the multi-stage fallacy is not to simply shorten their conjunctive breakdown or explicate its disjunctive counterpart, but rather to consider also conjunctive breakdowns of the counterpoint. The difficulty is not in reconciling one's conjunctive account of catastrophe with their disjunctive account of survival, but in reconciling a conjunctive account of catastrophe with a conjunctive account of survival.
Joe Carlsmith: I found this general framing helpful, thanks. That said, as mentioned above, I do feel like the right approach to this type of dynamic is specific-argument-dependent: e.g., some arguments just are more conjunctive in one formulation vs. another (see the "hit by lightning while winning the lottery" argument above).
In closing, I again note that I found Joe's report to be remarkably well-reasoned and thorough. I have harped on a variety of points of disagreement, but let us not lose sight of how much we do agree on, such as the overall framing of the problem, and what sorts of questions to be asking and arguments to be making. I see the analysis he uses to support his conclusions as a good breed of analysis; it has a form that takes into account many of the features of the strategic landscape that seem salient to me.
Great and extremely valuable discussion! There's one part that I really wished had been explored further - the fundamental difficulty of inner alignment:
Joe Carlsmith: I do have some probability that the alignment ends up being pretty easy. For example, I have some probability on hypotheses of the form "maybe they just do what you train them to do," and "maybe if you just don't train them to kill you, they won't kill you." E.g., in these worlds, non-myopic consequentialist inner misalignment doesn't tend to crop up by default, and it's not that hard to find training objectives that disincentivize problematically power-seeking forms of planning/cognition in practice, even if they're imperfect proxies for human values in other ways....
Nate: ...maybe it wouldn't have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?
I would very much like to see expansion (from either Nate/MIRI or Joe) on these points because they seem crucial to me. My current epistemic situation is (I think) similar to Joe's. Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall. I see lots of worrisome signs from indirect lines of evidence - some based on intuitions about the nature of intelligence, some from toy models and some from vague analogies to e.g. evolution. But what I don't see is a slam dunk argument that inner misalignment is an extremely strong attractor for powerful models of the sort we're actually going to build.
That also goes for many of the specific reasons given for inner misalignment - they often just seem to push the intuition one step further back. E.g. these from Eliezer Yudkowsky's recent interview:
I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can't be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms. ...attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can't be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms.
attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
seem like world models that make sense to me, given the surrounding justifications, and I wouldn't be amazed if they were true, and I also place a decent amount of credence on them being true. But I can't pass an ideological Turing test for someone who believes the above propositions with > 95% certainty, given the massive conceptual confusion involved with all of these concepts and the massive empirical uncertainty.
Statements like 'corrigibility is anti-natural in a way that can't easily be explained' and 'getting deep enough patches that generalize isn't just difficult but almost impossibly difficult' when applied to systems we don't yet know how to build at all, don't seem like statements about which confident beliefs either way can be formed. (Unless there's really solid evidence out there that I'm not seeing)
This conversation seemed like another such opportunity to provide that slam-dunk justification for the extreme difficulty of inner alignment, but as in many previous cases Nate and Joe seemed happy to agree to disagree and accept that this is a hard question about which it's difficult to reach any clear conclusion - which if true should preclude strong confidence in disaster scenarios.
(FWIW, I think there's a good chance that until we start building systems that are already quite transformative, we're probably going to be stuck with a lot of uncertainty about the fundamental difficulty of inner alignment - which from a future planning perspective is worse than knowing for sure how hard the problem is.)
Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall.
I strongly disagree with inner alignment being the correct crux. It does seem to be true that this is in fact a crux for many people, but I think this is a mistake. It is certainly significant. But I think optimism about outer alignment and global coordination ("Catch-22 vs. Saving Private Ryan") is much bigger factor, and optimists are badly wrong on both points here.
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
... seem like world models that make sense to me, given the surrounding justifications
... seem like world models that make sense to me, given the surrounding justifications
FWIW, I don't really understand those world models/intuitions yet:
(jtbc, I think inner misalignment might be a big problem, I just haven't seen any good argument for it plausibly being the main problem)
Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:
or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world
Note that this is still better than 'honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don't need to halt'!
I really appreciated this -- it felt like better access to the 'MIRI strategic viewpoint' than I think I've had in the past.I found it more clarifying than updating for me, with a couple of tangible exceptions:First, I found that I really like the example of response to coronavirus as an example of trillion-dollar warning shots. I think I've also previously agreed that responses to past disasters have been better, but the more recent example should be (all else equal) more informative to the other ones.Second, this point about factored cognition
Paul Christiano occasionally floats proposals of (what looks to me like) deferential cognitive systems that are too incapable to be scary, being composed into highly capable cognitive systems that inherit a deference property from their parts. (Paul might not endorse this gloss.) I basically expect the cognition to not compose to something capable, and insofar as it does I basically expect it not to inherit the deference property, and so I have little optimism for such approaches. But it's possible that Joe does, and that as such, the second bullet point above is doing a bunch of work for him that it's not doing for me.
I think this is pretty crux-y for me, and I wonder if it's crux-y for MIRI. This feels very close to the heart of one of my research questions, and if there were strong cases for this, I'd like to hear them.(My research is less about the whole inheriting deference from the parts, but instead inheriting transparency/interpretability -- I expect them to basically be the same with regards to this non-combination)
Thanks for putting this stuff online!
FWIW I agree with Nate (and my opinions were largely independent, having read the report and written a response before seeing this). Happy to discuss with anyone interested.
(I posted this for Nate.)
I think this is an excellent response (I'd even say, companion piece) to Joe Carlsmith's also-excellent report on the risk from power-seeking AI. On a brief re-skim I think I agree with everything Nate says, though I'd also have a lot more to add and I'd shift emphasis around a bit. (Some of the same points I did in fact make in my own review of Joe's report.)Why is it important for there to be a response? Well, the 5% number Joe came to at the end is just way too low. Even if you disagree with me about that, you'll concede that a big fraction of the rationalist community -- including some very well-respected, knowledgeable members -- thinks 5% is way too low. So it's important for their view to be at least partially represented.Beyond that, I think this post presents some good ideas clearly, and is worth reading in its own right even if you never read Joe's report. I just randomly scrolled to a section of it, "Difficulty of alignment" and right off the bat there are three bullet points that are unfortunately all too plausible & worth visualizing.
Planned summary for the Alignment Newsletter:
This response to the report above touches on many topics, but has three main object-level disagreements and one meta-level disagreement:1. The author has significantly shorter timelines, though this is based on a very different argument structure than the one presented in the report above, and so it is hard to turn this into more concrete disagreements with the report.2. The author expects that alignment is hard enough that we won’t solve it in time (which is not to say that it is harder than every other technical problem humanity has ever faced). It’s also not clear how to turn this into more concrete disagreements with the report.3. The author does not expect to have warning shots where misaligned AI systems cause trillions of dollars of damage but _don’t_ cause an existential catastrophe, because this seems like too narrow a capability range for us to hit in practice. Even if there are warning shots, he expects that civilization will continue to deploy risky AI systems anyway, similarly to how we are not banning gain-of-function research despite the warning shot of COVID-19.4. On the meta level, the author expects that the decomposition of the AI risk argument into six conjunctive claims will typically bias you towards giving too low a probability on the overall conjunction.
This response to the report above touches on many topics, but has three main object-level disagreements and one meta-level disagreement:
1. The author has significantly shorter timelines, though this is based on a very different argument structure than the one presented in the report above, and so it is hard to turn this into more concrete disagreements with the report.
2. The author expects that alignment is hard enough that we won’t solve it in time (which is not to say that it is harder than every other technical problem humanity has ever faced). It’s also not clear how to turn this into more concrete disagreements with the report.
3. The author does not expect to have warning shots where misaligned AI systems cause trillions of dollars of damage but _don’t_ cause an existential catastrophe, because this seems like too narrow a capability range for us to hit in practice. Even if there are warning shots, he expects that civilization will continue to deploy risky AI systems anyway, similarly to how we are not banning gain-of-function research despite the warning shot of COVID-19.
4. On the meta level, the author expects that the decomposition of the AI risk argument into six conjunctive claims will typically bias you towards giving too low a probability on the overall conjunction.
Random nitpick, doesn't seem that central to anything:
Couldn't you spend 200,000 hours (= 23 years) to have the 6yo grow up and train to be a software engineer, and then have 800,000 trained-software-engineer-hours?
(Maybe the point is that you still need the whole infrastructure of school and so on for the 6yo, and if you just imagine giving the 6yo a goal and saying "go", you don't expect them to do well?)
(One way this could be central is that I think this sort of "slow recursive improvement" is totally plausible while Nate thinks it isn't?)
condoms aren't a failure of reproductive fitness, they're a success - planned children fare better [citation needed, please search]. more generally, evolution favors total maximum-term inclusive fitness, and the fact that I want to have no kids for a long time, and instead preserve my DNA within my own cells by not dying, does not feel like I've dismissed my built-in objectives, quite the opposite in fact. the only constraint that defines fitness in evolution is "does your pattern of physical interaction survive enough to maintain its existence indefinitely".
evolution is not reliably consistent about finding solutions to this constraint, and certainly not ones that never change, but it's quite rare for humans to be catastrophically misaligned with reproducing their own patterns. I know of very few humans who want none of [genetic reproduction, immortality, altruism-for-sibling, upload-to-documentation, bulk-altruism], all of which are attempts to maintain a subset of the being's patterns of interaction, plausibly in descending order of pattern-completeness.
certainly humans do sometimes decide that the important pattern that identifies their reproduceable traits isn't genetic but rather memetic, which may be considered a failure of alignment in some metrics, but I don't clearly see how, given evolution's fitness function is only definable in terms of which chemical/energy processes replicate their own behaviors reliably. and indeed, our species' success is heavily dependent on accumulation of knowledge from people in the past who "uploaded" small fragments of their own insights in various ways, like books. "living on through your influence" is actually meaningful, in that encoding your ideas into language that can be read and processed does meaningfully reproduce your pattern as long as the language will be read by creatures similar enough to yourself. memes are in effect additional genes that can be output at any time, and sturdy memes like math survive to be appended to the brains of all those whose genes-and-available-resources cause them to end up being able to use those useful memes.
so I guess, I claim, "minimize death and data loss" is a fully general solution to evolution that generalizes by maximizing the area of space considered "self". this is in no way at all a solution to the safety problem, just a rephrase of the question to point out that a solution almost cannot be contrary to fitness and that alignment to evolution is the only way it could have gone, best as I can tell, because evolution only exists in terms of how much further forward in time your patterns maintain identity. it doesn't care if you fork or merge, as long as the net sum of your patterns exists in self-maintaining form. life is an energy gradient that maintains itself.
“stuff the human brain does easily in a half-second”
This is ambiguous between tasks the brain does in a half-second, vs. everything the brain does in a half-second. In a half-second the brain does a bunch of stuff to perform well in instance of the half-second long task it's currently doing, and also it's doing other stuff to e.g. learn how to do perform well in future instances of the task, and to "understand" the elements of the task insofar as those elements will also appear in other tasks. AFAIK ML is systematically more convincing about task performance than about transfer.
My basic take on this question is "that's doubtful (that humanity will be able to pull off such a thing in the relevant timeframes)". It seems to me that making a system "deferential all the way down" would require a huge feat of mastery of AI internals that we're nowhere close to.
We build deferential systems all the time and seem to be pretty good at it. For example, nearly 100% of the individuals in the US military are capable of killing Joe Biden (mandatory retirement age for the military is 62). But nonetheless Joe Biden is the supreme commander of the US armed forces.