A low quality prior on odds of lucky alignment: we can look at the human intelligence sharp left turn from different perspectives
Worst case scenario S risk: pigs, chickens, cows
X risk: Homo florensis, etc
Disastrously unaligned but then the superintelligence inexplicably started to align itself instead of totally wiping us out: Whales, gorillas
unaligned but that's randomly fine for us: raccoons, rats
Largely aligned: Housecats
X risk would be passenger pigeons, no?
Anyway your comment got me thinking. So far it seems the territory colonized by humans is a subset of the territory previously colonized by life, not stretching beyond it. And the territory covered by life is also not all of Earth, nevermind the universe. So we can imagine AI occupying the most "cushy" subset of former human territory, with most humans removed from there, some subsisting as rats, some as housecats, some as wild animals periodically hit by incomprehensible dangers coming from the AI zone (similar to oil spills and habitat destruction), and some in S-risk type situations due to the AI remaining concerned with humans in some way.
Though this "concentric circles" model is maybe a bit too neat to imagine, and too similar to existing human myths about gods and so on. So let's not trust it too much.
So we can imagine AI occupying the most "cushy" subset of former human territory
We can definitely imagine it - this is a salience argument - but why is it at all likely? Also, this argument is subject to reference class tennis: humans have colonized much more and more diverse territory than other apes, or even all other primates.
Once AI can flourish without ongoing human support (building and running machines, generating electricity, reacting to novel environmental challenges), what would plausibly limit AI to human territory, let alone "cushy" human territory? Computers and robots can survive in any environment humans can, and in some where we at present can't.
Also: the main determinant of human territory is inter-human social dynamics. We are far from colonizing everywhere our technology allows, or (relatedly) breeding to the greatest number we can sustain. We don't know what the main determinant of AI expansion will be; we don't even know yet how many different and/or separate AI entities there are likely to be, and how they will cooperate, trade or conflict with each other.
I think “Luck could be enough” should be the strong default on priors,2 so in some sense I don’t think I owe tons of argumentation here (I think the burden is on the other side).
I agree with this being the default and the burden being on the other side. At the same time, I don't think of it as a strong default.
Here's a frame that I have that already gets me to a more pessimistic (updated) prior:
It has almost never happened that people who developed and introduced a revolutionary new technology displayed a lot of foresight about its long-term consequences. For instance, there were comparatively few efforts at major social media companies to address ways in which social media might change society for the worse. The same goes for the food industry and the obesity epidemic or online dating and its effects on single parenthood rates. When people invent cool new technology, it makes the world better on some metrics but creates new problems on its own. The whole thing is accelerating and feels out of control.
It feels out of control because even if we get cool new things from tech progress, we don't seem to be getting any better at fixing the messiness that comes with it (misaligned incentives/goodhearting, other Molochian forces, world-destroying tech becoming ever more accessible). Your post says "a [] story of avoiding catastrophe by luck." This framing makes it sound like things would be fine by default if it isn't for some catastrophe happening. However, humans have never seemed particularly "in control" over technological progress. For things to go well, we need the opposite of a catastrophe – a radical change towards the upside. We have to solve massive coordination problems and hope for a technology that gives us god-like power, finally putting sane and compassionate forces in control over the future. It so happens that we can tell a coherent story about how AI might do this for us. But to say that it might go right just by luck – I don't know, that seems far-fetched!
All of that said, I don't think we can get very far arguing from priors. What carries by far the most weight are arguments about alignment difficulty, takeoff speeds, etc. And I think it's a reasonable view to say that it's very unlikely that any researchers currently know enough to make highly confident statements about these variables. (Edit: So, I'm not sure we disagree too much – I think I'm more pessimistic about the future than you are, but I'm probably not as pessimistic as the position you're arguing against in this post. I mostly wanted to make the point that I think the "right" priors support at least moderate pessimism, which is a perspective I find oddly rare among EAs.)
FWIW, it's not obvious to me that slow takeoff is best. Fast takeoff at least gives you god-like abilities early on, which are useful from a perspective of "we were never particularly in control over history; lots of underlying problems need fixing before we pass a point of no return." By contrast, with slow takeoff, coordination problems seem more difficult because (at least by default) there will be more actors using AIs in some ways or other and it's not obvious that the AIs in a slow-takeoff scenario will be all that helpful at facilitating coordination.
My view is that we've already made some significant progress on alignment, compared to say where we were O(15) years ago, and have also had some unexpectedly lucky breaks. Personally I'd list:
This is a personal list and I'm sure will be missing some items.
That we've made some progress and had some lucky breaks doesn't guarantee that this will continue, but it's unsurprising to me that
One of the most dangerous thing that even one misaligned AI could theoreticaly pull, is to successfully launch a misaligned Von Neumann probe. Because then it would be extremely hard to track it down in space and stop before it will do it's thing.
What about quickly launching a missile following its trajectory using the same technology? The probe eventually needs to slow down to survive impact and the missile doesn't so preventing Von Neumann probes seems fairly straightforward to me. My understanding is that tracking objects in space is very easy unless they've had time to cool to near absolute zero.
On the other hand, this requires a misaligned AI was able to build such a probe and get it on a rocket it built or commandeered without being detected or stopped. That rules out safety via monitoring (and related approaches) and we would need to rely on it being essentially aligned anyway (such as via the "natural generalizations" Holden mentioned).
I think interpretability looks like a particularly promising area for “automated research” - AIs might grind through large numbers of analyses relatively quickly and reach a conclusion about the thought process of some larger, more sophisticated system.
Arguably, this is already starting to happen (very early, with obviously-non-x-risky systems) with interpretability LM agents like in FIND and MAIA.
Related, from Advanced AI evaluations at AISI: May update:
Short-horizon tasks (e.g., fixing a problem on a Linux machine or making a web server) were those that would take less than 1 hour, whereas long-horizon tasks (e.g., building a web app or improving an agent framework) could take over four (up to 20) hours for a human to complete.
[...]
The Purple and Blue models completed 20-40% of short-horizon tasks but no long-horizon tasks. The Green model completed less than 10% of short-horizon tasks and was not assessed on long-horizon tasks3. We analysed failed attempts to understand the major impediments to success. On short-horizon tasks, models often made small errors (like syntax errors in code). On longer horizon tasks, models devised good initial plans but did not sufficiently test their solutions or failed to correct initial mistakes. Models also sometimes hallucinated constraints or the successful completion of subtasks.
Summary: We found that leading models could solve some short-horizon tasks, such as software engineering problems. However, no current models were able to tackle long-horizon tasks.
I’ve been trying to form a nearcast-based picture of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe) and two “success stories” (one presuming a relatively gradual takeoff, one assuming a more discontinuous one).
Those success stories rely on a couple of key actors (a leading AI lab and a standards-and-monitoring organization) making lots of good choices. But I don’t think stories like these are our only hope. Contra Eliezer, I think we have a nontrivial1 chance of avoiding AI takeover even in a “minimal-dignity” future - say, assuming essentially no growth from here in the size or influence of the communities and research fields focused specifically on existential risk from misaligned AI, and no highly surprising research or other insights from these communities/fields either. (There are further risks beyond AI takeover; this post focuses on AI takeover.)
This is not meant to make anyone relax! Just the opposite - I think we’re in the “This could really go lots of different ways” zone where marginal effort is most valuable. (Though I have to link to my anti-burnout take after saying something like that.) My point is nothing like “We will be fine” - it’s more like “We aren’t stuck at the bottom of the logistic success curve; every bit of improvement in the situation helps our odds.”
I think “Luck could be enough” should be the strong default on priors,2 so in some sense I don’t think I owe tons of argumentation here (I think the burden is on the other side). But in addition to thinking “I haven’t heard knockdown arguments for doom,” I think it’s relevant that I feel like I can at least picture success with minimal dignity (while granting that many people will think my picture is vague, wishful and wildly unrealistic, and they may be right). This post will try to spell that out a bit.
It won’t have security mindset, to say the least - I’ll be sketching things out that “could work,” and it will be easy (for me and others) to name ways they could fail. But I think having an end-to-end picture of how this could look might be helpful for understanding my picture (and pushing back on it!)
I’ll go through:
As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Carl Shulman.
The initial alignment problem
What happens if you train an AI using the sort of process outlined here - essentially, generative pretraining followed by reinforcement learning, with the latter refereed by humans?
I think danger is likely by default - but not assured. It seems to depend on a number of hard-to-predict things:
For what it’s worth, it seems to me like the very shortest, straightest-line imaginable path to transformative AI from today looks relatively favorable on the above dimensions. Today’s most capable models mostly look like “lots of generative pretraining, tiny amounts of reinforcement learning,” which seems like a good thing at least from the point of view of (a) developing a large suite of human-like concepts; (b) a good chance that the RL part of the training can be highly “accurate” (few opportunities to get better performance by deceiving, manipulating or overpowering supervisors). It’s imaginable to AI models could become transformative with a handful of additions (see footnote6) that - while greatly changing capabilities and introducing serious alignment risks - wouldn’t clearly change what I just said.
I’d guess that the situation will get worse (e.g., more outcome-based training) compared to that picture, but I don’t know that. And even if it does, it still seems like we could end up with models that have human-level and human-like capabilities and lack ambitious goals of their own. (Even without a special effort.)
For an opposing view, see my dialogue with Nate Soares. I could imagine being about 70% confident of Nate’s view if I’d done a lot of thinking I haven’t done, but I doubt most people are in that sort of position or should trust Nate (or Eliezer) to be, to the point of being certain of danger by default.
Basic countermeasures
There are also possible countermeasures to alignment risk, identifiable today, that I can imagine going quite well, without needing new breakthrough insights (we would need luck, though). Examples:
Simple checks and balances. An example: we might train “AI watchdogs” to make the best arguments they can (ideally including discussions of internal states) that other AIs are dangerous, that specific measures can make them less dangerous, etc. Gradient descent could be used to both push against spurious arguments (as assessed by randomized deep dives) while pushing toward more important+correct (according to human judges and auditors) critiques. There are tons of reasons this might not work, but I don’t think we should be shocked if it produces huge dividends, even coming from simply human-level AIs that are well-suited to finding flaws and dangers in models and outputs similar to themselves. These dividends could lead to highly accurate training that makes it hard to learn unintended generalizations, and/or to deployment scenarios where it’s overly hard for human-level-ish AIs to coordinate against humans.
Intense red-teaming. I’ve written about how adversarial training could “undo” a good deal of misalignment that has popped up during training, especially if something like relaxed adversarial training (simulating inputs that one can’t actually produce) turns out to be feasible. It’s plausible to me that AI companies might invest heavily in this kind of work, without needing to be mostly motivated by existential risk reduction (they might be seeking intense guarantees against e.g. lawsuit-driving behavior by AI systems).
Training on internal states. I think interpretability research could be useful in many ways, but some require more “dignity” that I’m assuming here7 and/or pertain to the “continuing alignment problem” (next section).8 If we get lucky, though, we could end up with some way of training AIs on their own internal states that works at least well enough for the initial alignment problem.
Training AIs on their own internal states risks simply training them to manipulate and/or obscure their own internal states, but this may be too hard for human-level-ish AI systems, so we might at least get off the ground with something like this.
A related idea is finding a regularizer that penalizes e.g. dishonesty, as in Eliciting Latent Knowledge.
It’s pretty easy for me to imagine that a descendant of the Burns et al. 2022 method, or an output of the Eliciting Latent Knowledge agenda, could fit this general bill without needing any hugely surprising breakthroughs. I also wouldn’t feel terribly surprised if, say, 3 more equally promising approaches emerged in the next couple of years.
The deployment problem
Once someone has developed safe, powerful (human-level-ish) AI, the threat remains that:
The situation has now changed in a few ways:
It’s hard to say how all these factors will shake out. But it seems plausible that one of these things will happen:
Any of these could lead to a world in which misaligned AI in the wild is at least rare relative to aligned AI. The advantage for humans+aligned-AIs could be self-reinforcing, as they use their greater numbers to push measures (e.g., standards and monitoring) to suppress misaligned AI systems.
I concede that we wouldn’t be totally out of the woods in this case - things might shake out such that highly-outnumbered misaligned AIs can cause existential catastrophe. But I think we should be optimistic by default from such a point. A footnote elaborates on this, addressing Steve Byrnes’s discussion of a related topic (which I quite liked and think raises good concerns, but isn’t decisive for the scenario I’m contemplating).10
More generally, I think it’s very hard to reason about a world with human-level-ish aligned AIs widely available (and initially outnumbering comparably powerful misaligned AIs), so I think we should not be too confident of doom starting from that point.
Some objections to this picture
The most common arguments I’ve heard for why this picture is hopeless involve some combination of:
I think all of these arguments are plausible, but very far from decisive (and indeed each seems individually <50% likely to me).
Success without dignity
This section is especially hand-wavy and conversational. I probably don’t stand by what you’d get from reading any particular sentence super closely and taking it super seriously. I stand by some sort of vague gesture that this section is trying to make.
I have a high-level intuition that most successful human ventures look - from up close - like dumpster fires. I’m thinking of successful organizations - including those I’ve helped build - as well as cases where humans took highly effective interventions against global threats, e.g. smallpox eradication; recent advances in solar power that I’d guess are substantially traceable to subsidy programs; whatever reasons we haven’t had a single non-test nuclear detonation since 1945.
I expect the way AI risk is “handled by society” to look like a dumpster fire, in the sense that lots of good interventions will be left on the table, lots of very silly things will be done, and no intervention will be satisfyingly robust. Alignment measures will be fallible, standards regimes will be gameable, security setups will be imperfect, and even the best AI labs will have lots of incompetent and/or reckless people inside them doing scary things.
But I don’t think that automatically translates to existential catastrophe, and this distinction seems important. (An analogy: “that bednet has lots of gaping holes in it” vs. “That bednet won’t help” or “That person will get malaria.”) The future is uncertain; we could get lucky and stumble our way into a good outcome.
Furthermore, there are a number of interventions that could interact favorably with some baseline good luck. (I’ll discuss this more in a future post.)
One key strategic implication of this view that I think is particularly worth noting:
I don’t feel emotionally attached to my headspace. It’s nice to not think we’re doomed, but not a very big deal for me,14 and I think I’d enjoy work premised on the first headspace above at least as much as work premised on the second one.
The second headspace is just what seems right at the moment. I haven’t seen convincing arguments that we won’t get lucky, and it seems to me like lots of things can amplify that luck into better odds of success. If I’m missing something correctible, I hope this will prompt discussion that leads there.
Notes
Like >10% ↩
Since another way of putting it is: “AI takeover (a pretty specific event) is not certain (conditioned on the ‘minimal-dignity’ conditions above, which don’t seem to constrain the future a ton).” ↩
Phase 1 in this analysis ↩
Phase 2 in this analysis ↩
I think there are alternative ways things could go well, which I’ll cover in the relevant section, so I don’t want to be stuck with a “pivotal acts” frame. ↩
Salient possible additions to today’s models:
It’s not out of the question to me that we could get to transformative AI with additions like this, and with the vast bulk of the training still just being generative pretraining. ↩
E.g., I think interpretability could be very useful for demonstrating danger, which could lead to a standards-and-monitoring regime, but such a regime would be a lot more “dignified” than the worlds I’m picturing in this post. ↩
I think interpretability is very appealing as something that large numbers of relatively narrow “automated alignment researchers” could work on. ↩
Debate-type setups seem like they would get harder for humans to adjudicate as AI systems advance; more advanced AI seems harder to red-team effectively without its noticing “tells” re: whether it’s in training; internal-state-based training seems more likely to result in “manipulating one’s own internal states” for more advanced AI; ↩
Byrnes’s post seems to assume there are relatively straightforward destruction measures that require draconian, scary “plans” to stop. (Contrast with my discussion here, in which AIs can be integrated throughout the economy in ways that makes it harder for misaligned AIs to “get off the ground” with respect to being developed, escaping containment and acquiring resources.)
The latter, more dangerous possibility seems more likely to me, but it seems quite hard to say. (There could also of course be a hybrid situation, as the number and capabilities of AI grow.) ↩
I think optimizing for community epistemics has real downsides, both via infohazards/empowering bad actors and via reputational risks/turning off people who could be helpful. I wish this weren’t the case, and in general I heuristically tend to want to value epistemic virtue very highly, but it seems like it’s a live issue - I (reluctantly) don’t think it’s reasonable to treat “X is bad for community epistemics” as an automatic argument-ender about whether X is bad (though I do think it tends to be a very strong argument). ↩
E.g., working for an AI lab and speeding up AI (I plan to write more about this).
More broadly, it seems to me like essentially all attempts to make the most important century go better also risk making it go a lot worse, and for anyone out there who might’ve done a lot of good to date, there are also arguments that they’ve done a lot of harm (e.g., by raising the salience of the issue overall).
Even “Aligned AI would be better than misaligned AI” seems merely like a strong bet to me, not like a >95% certainty, given what I see as the appropriate level of uncertainty about topics like “What would a misaligned AI actually do, incorporating acausal trade considerations and suchlike?”; “What would humans actually do with intent-aligned AI, and what kind of universe would that lead to?”; and “How should I value various outcomes against each other, and in particular how should I think about hopes of very good outcomes vs. risks of very bad ones?”
To reiterate, on balance I come down in favor of aligned AI, but I think the uncertainties here are massive - multiple key questions seem broadly “above our pay grade” as people trying to reason about a very uncertain future. ↩
I’m a person who just doesn’t pretend to be emotionally scope-sensitive or to viscerally feel the possibility of impending doom. I think it would be hard to do these things if I tried, and I don’t try because I don’t think that would be good for anyone.
I like doing worthy-feeling work (I would be at least as happy with work premised on a “doomer” worldview as on my current one) and hanging out with my family. My estimated odds that I get to live a few more years vs. ~50 more years vs. a zillion more years are quite volatile and don’t seem to impact my daily quality of life much. ↩