"The MIRI types" were very explicit that they were doing security mindset thinking, trying to think about all the things that could possibly go wrong, in advance. This is entirely appropriate and reasonable when not only 8.3 billion lives, but also all their descendants for however long and far the human race would otherwise have got (at least several orders of magnitude more, quite possibly astronomically more), are on the line.
However, the most likely result of thinking long and hard about everything that could possible go wrong and then publicly posting long lists of them (if you do it right), is that, actually, nature doesn't throw you quite that many curveballs — and then people get surprised by why things aren't turning out as badly as MIRI were concerned they might. Now, they did miss one or two (jailbreaks, for example: almost no-one saw that coming before it happened), but in general they managed to think of almost everything Reality has actually thrown at us, plus quite a bit more — and that was the goal.
I'm happy Reality hasn't been that sadistic with us so far. I try to remember this when updating my P(DOOM). But I'm still not going to rely on this going forwards — complacency is not an appropriate response to existential risk.
yeah I am very grateful for MIRI, and I don't think we should be complacent about existential risks (e.g. 50% P(doom) seems totally reasonable to me)
I think people should be jumping up and down and yelling if their P(DOOM) is even 0.01%. Extinction is far worse than just killing almost all of the 8.2 billion of us: it also destroys all our potential descendants. Even if you were absolutely certain that we were never going to go to the stars, the average mammalian species lasts O(1million years), i.e O(10,000) lifetimes. So a 0.01% P(DOOM) is at least as bad as a certainty of killing almost everyone alive now (but leaving a few to rebuild) — or astronomically worse if there's any chance of us going to the stars and multiplying the loss astronomically by the volume of our forward lightcone.
There is VAST difference in severity between merely "kill almost everyone" and "make the human species extinct". Like, at a minimum, 4 orders of magnitude, and quite possibly something more like 10-15 orders of magnitude. (That's why the DOOM is in capital letters.)
(This is a point that I think some people outside the LW/EA community miss about MIRI — they're techno-optimists. They confident we're going to the stars, if we can just avoid killing ourselves first. That's why they keep talking about lightcones. So they believe it's at least 10^~14 times as bad, not 10^4: they also multiplying by the number of habitable planets in the galaxy, and allowing for the fact that a more widespread species are less likely to get all wiped out so likely to last longer. Can't say as I disagree with them, and even if you think the chance of us going to the stars is only, say, 0.1%, that still utterly dominates the calculation.)
[P.S. Footnote added since Eli Tyre questioned my MIRI-hatted very-rough Fermi estimate: O(10^11) stars in the galaxy, say O(0.1%) have a habitable planet, but being widely distributed we last O(100) times as long = extra factor of O(10^~10). Yes, I am ignoring Dyson Swarms, Grabby Aliens, and a lot of other possibilities.]
I think there's an implicit assumption of tiny discount factors here, which are probably not held by the majority of the human population. If your utility function is such that you care very little about what happens after you die, and/or you mostly care for people in your immediate surroundings, your P(DOOM) needs to be substantially higher for you to start caring significantly.
This is not to mention Pascal's mugging type arguments, where you should be unconvinced to make significant life choices from an unconvincing probability of some large thing.
This is not to say that I'm against x-risk research – my P(DOOM) is about 60% or so. This is more just to say that I'm not sure people with a non-EA worldview should necessarily be convinced by your arguments.
Discount factors are a cheap stand-in for three effects, none of which apply to P(DOOM):
a) difficulty of predicting the future. That extinction is forever is not a difficult prediction. (In other news, Generalissimo Francisco Franco is still dead.)
b) someone closer to the time (possibly even me) may handle that. But not if everybody is dead.
c) GPD growth rates. Which are zero if everybody is dead.
(Or to quote a bald ASI, even three million years into the future it remains true that: Everybody is dead, Dave.)
But yes, I should have pointed out that in this particular case, the normal assumption that you can safely ignore the far future and it will take care of itself does not apply.
Hmm, perhaps. My intuition behind discount factors is different, but I'm not sure it's a crux here. I agree that extinction leads to 0 utility for everyone everywhere, but the point I was making was more that with low discount factors the massive potential of humanity has significant weight, while a high discount factor sends this to near 0.
In this worldview, near-extinction is no-longer significantly better than extinction.
That aside, I think the stronger point is that if you only care about people near to you, spatially and temporally (as I think most people implicitly do), the thing you end up caring about is the death of maybe 10 - 1000 people (discounted by your familiarity with them, so probably at most equivalent to ~100 deaths of nearby family) rather than 8000000000.
Some napkin maths as to how much someone with that sort of worldview should care: a 0.01% chance of doom in the next ~20 years then gives ~1% of an equivalent expected death in the next 20 years. 20 years is ~17 million hours, which would make it about 7.5x less worrisome than driving according to this infographic.
Again, very napkin maths, but I think my basic point is that a 0.01% P(Doom) coupled with a non-longtermist, non-cosmopolitan view seems very consistent with "who gives a shit".
Such a person is very badly miscalculating their evolutionary fitness — but then, what else is new?
Number of relations grows exponentially with distance, genetic relatedness grows with log of distance, so assume you have e.g 1 sibling, 2 cousins, 4 second cousins etc, each layer will have an equivalent fitness contribution. log2(8 billion) = 33. Fermi estimate of 100 seems around right?
If anything, I get the impression this is overestimating how much people actually care, because there's probably an upper bound somewhere before this point.
If your species goes extinct, you genetic fitness just went to 0, along with everyone else's. Species-level evolution is also a thing.
Is the implication here that you should also be caring about genetic fitness as carried into the future? My basic calculation here was that in purely genetic terms, you should care about the entire earth's population ~33x as much as a sibling (modulo family trees are a bunch messier at this scale, so you probably care about it more than that).
I feel like at this scale the fundamental thing is that we are just straight up misaligned with evolution (which I think we agree on).
Indeed. I'm enough of a sociobiologist to sometimes put some intellectual effort into trying to be aligned with evolution, but I attempt not to overdo it.
Far more likely, they're not calculating their evolutionary fitness at all. Our having emotions and values that are downstream of evolution doesn't imply that we have a deeper goal of maximising fitness.
There were some old ideas about "tool AI", "oracle AI", and "myopic AI" as being "less dangerous" forms of AI. What we actually have is "AI that is bad at long-range tasks, and especially planning" for now, plus a tremendous economic incentive to make that hole in its capacity go away as fast as possible, and realistic graphs suggesting that hole may take another 2-5 years to completely go away.
That's… not ideal, but better than the "no warning shots at all" worst case.
Yeah I mostly agree.
It's not that I expect the AIs of the next 2-5 years to be myopic in some strict sense, but rather that (relative to reasonable pre-LLM priors) I expect their capabilities to arise more out of (generalized) imitation, and still be be sort of globally incoherent (i.e pursing different conflicting objectives).
but this source of optimism gets weaker as RL becomes more important, and it sure does seem to be becoming more important.
Suppose AI assistants similar to Claude transform the economy. Now what? How is the risk of human extinction reduced?
Yes, alignment researchers have become more capable, and yes, the people trying to effect an end or a long pause of AI "progress" have become more capable, but so have those trying to effect more AI "progress".
Also, rapid economic change entails rapid changes in most human interpersonal relationships, which is hard on people and according to some is the main underlying cause of addictions. Addicts aren't likely to come to understand that AI "progress" is very dangerous even if they are empowered by AI assistants.
Your argument supports the assertion that if AI "progress" stopped now then we'd be better off then we would've been without any AI progress (in the past), but of course that is very different than being optimistic about the outcome of AI tech's continuing to "progress".
the hope is that
a) we to get the transformative ai to do our alignment homework for us, and
b) that companies / society will become more concerned about safety (such that the ratio of safety to capabilities research increases a lot)
Increasing the volume of alignment research helps only if alignment research does not help capabilities researchers, but my impression is that most alignment research that has been done so far has helped capabilities researchers approximately as much as it has helped alignment researchers. Just because a line of research is described as "alignmnent research" does not automatically cause it to help alignment researchers more than capability researchers.
In summary, I don't consider your (a) a cause for hope because the main problem is that increasing capabilities to the point of disaster (extinction or such) is easier than solving the alignment problem and your (a) does not ameliorate the main problem much if at all for the reason I just explained.
Slow takeoff implies that we'll get the stupidest possible transformative AI first.
I just want to say this is an amazing turn of phrase, I assume borrowed by the common phrasing that humans are the stupidest possible general intelligence?
Not quite. We're about 300,000 years of evolution past that, so call it 12,000 generations, under fairly strong evolutionary pressures.
...and I think this is the positive update. It feels very plausible, in a visceral way, that the first economically transformative AI systems could be, in many ways, really dumb.
I think this is right, but I think is misleadingly encouraging.
I keep having to remind myself: Most of the risk does not come from the early transformative AIs. Most of the risk comes from the overwhelming superintelligences that come only a few years later.
Maybe we can leverage our merely transformative capabilities into a way to stick the landing with the overwhelming superintelligences, but that's definitely not a forgone conclusion.
yeah I agree, I think the update is basically just "AI control and automated alignment research seem very viable and important", not "Alignment will be solved by default"
Some people have been getting more optimistic about alignment. But from a skeptical / high p(doom) perspective, justifications for this optimism seem lacking.
"Claude is nice and can kinda do moral philosophy" just doesn't address the concern that lots of long horizon RL + self-reflection will lead to misaligned consequentialists (c.f. Hubinger)
So I think the casual alignment optimists aren't doing a great job of arguing their case. Still, it feels like there's an optimistic update somewhere in the current trajectory of AI development.
It really is kinda crazy how capable current models are, and how much I basically trust them. Paradoxically, most of this trust comes from lack of capabilities (current models couldn't seize power right now if they tried).
...and I think this is the positive update. It feels very plausible, in a visceral way, that the first economically transformative AI systems could be, in many ways, really dumb.
Slow takeoff implies that we'll get the stupidest possible transformative AI first. Moravec's paradox leads to a similar conclusion. Calling LLMs a "cultural technology" can be a form of AI denialism, but there's still an important truth there. If the secret of our success is culture, then maybe culture(++) is all you need.
Of course, the concern is that soon after we have stupid AI systems, we'll have even less stupid ones. But on my reading, the MIRI types were skeptical about whether we could get the transformative stuff at all without the dangerous capabilities coming bundled in. I think LLMs and their derivatives have provided substantial evidence that we can.