Epistemic status: n=3, more about building a culture of sharing and reasoning through numbers than any given particular estimates.
I've been somewhat surprised by how substantially different the probability estimates of on AI risk are from smart people who are well-informed, even when asked discreetly in high-trust settings. Among equally smart people who are highly informed, I hear numbers ranging from as low as "rounds to 0%" and as high as 80%+.
Following along the lines of discretion and not being alarmist in public, I've mostly just had private discussions with smart people exchanging information and such. But given that discussion in public is now happening more broadly, I thought I'd make a quick post summarizing a recent conversation. I'm not sure the exact numbers or reasoning are particularly valuable, but maybe it's a helpful jumping-off point for others in having these conversations and sharing estimates.
Last night I had a long conversation with two smart friends. Here was the question:
Interestingly, both people - before asking odds - said some variant of "not very likely" — but following along the lines of "words of estimative probability" (useful read, btw), "not very likely" turned out to be 10% and 15%!
Here were our odds.
Very smart, very well informed ML engineer: 10%
Very smart, well informed for general public but non-programmer non-researcher businesswoman: 15%
My personal odds: 23%
I won't summarize the others' positions and reasoning without checking with them (they gave me permission to share their odds, but we didn't write down our discussion and it would take a while to coordinate on that).
But my position broke down like this:
To break those down a bit...
1% chance of major industrial accident
Humanity has probably done somewhere between 2 and 6 major experiments that had some very small a priori chance to destroy all life on Earth.
The Trinity Test is probably the most famous:
Enrico Fermi offered to take wagers among the top physicists and military present on whether the atmosphere would ignite, and if so whether it would destroy just the state, or incinerate the entire planet. This last result had been previously calculated by Bethe to be almost impossible, although for a while it had caused some of the scientists some anxiety.
Some people speculated similarly that the Large Hadron Collider could have had a catastrophic outcome.
The experiments at the Large Hadron Collider sparked fears that the particle collisions might produce doomsday phenomena, involving the production of stable microscopic black holes or the creation of hypothetical particles called strangelets. Two CERN-commissioned safety reviews examined these concerns and concluded that the experiments at the LHC present no danger and that there is no reason for concern, a conclusion endorsed by the American Physical Society. The reports also noted that the physical conditions and collision events that exist in the LHC and similar experiments occur naturally and routinely in the universe without hazardous consequences, including ultra-high-energy cosmic rays observed to impact Earth with energies far higher than those in any human-made collider.
Obviously, both of those turned out fine. But with the increasing amount of computation and new capabilities made possible by machine learning and possible next paradigms for artificial intelligence, (1) we're collectively going to be able to run a lot more experiments along these lines, and (2) even if they're highly likely a priori to be safe, being wrong once is game over.
I have the odds that people run such dangerous experiments more frequently as high, the odds of any given individual experiment going off catastrophically as low, but multiplying them together, we might wind up with a 1% extinction risk from something like an industrial accident or experimental accident.
2% chance of strong version of IIT is true and outcome is bad
Integrated Information Theory is a theory about how consciousness is formed. There's been a lot of back-and-forth on it. Some very smart people (e.g. Tegmark) seem to think it's very possible. Some other very smart people (e.g. Aaronson) seem to be opposed.
I personally refer to a "weak version of IIT" and a "strong version of IIT" — the weak version is that a high IIT score is a necessary condition for consciousness. My personal odds on that aren't quantified, but are very high. It seems intuitively correct. The strong version of the hypothesis, in my phrasing, is that a high IIT score is both necessary and sufficient. I think that's far more unlikely, maybe around 3% likelihood. Arbitrarily, I then think the odds that "if strong IIT is true, the results are very bad and a subsequent agent would be unaligned on the current trajectory" is around 2/3rds. Hence, 2%.
20% chance of instrumental convergence
While not everyone has quantified their odds and evaluation process, my impression is that the general public — while not knowing the formal theory — seems to be far more worried about the IIT hypothesis being true and creating a Skynet-like entity, as a result of various cinematic and science fiction content in popular culture.
Whereas serious researchers seem to think instrumental convergence doesn't require IIT to be true, and is far more likely.
One hypothetical example of instrumental convergence is provided by the Riemann hypothesis catastrophe. Marvin Minsky, the co-founder of MIT's AI laboratory, has suggested that an artificial intelligence designed to solve the Riemann hypothesis might decide to take over all of Earth's resources to build supercomputers to help achieve its goal.
Additionally, I think "odds of instrumental convergence creating a catastrophic outcome" seems to be the thing that smart people are most likely to disagree quite broadly on.
Much smarter and more well-informed people than I have written about instrumental convergence.
Additionally: Timelines on "Meaningful Recursive Self-Improvement"
Two final thoughts.
There seems to be wide disagreement on what year "meaningful recursive self-improvement" will start happening. And yes, "meaningful" is unfortunately qualitative. Trivial recursive self-improvement has been possible for a while. Perhaps someone could formalize this.
I'm in the "sooner than later but not imminently" camp on this. I could explore it further. But it seems, anecdotally, like people who think meaningful recursive self-improvement will happen later are more optimistic and people who think it will happen sooner are more pessimistic.
Final thought: I think the industrial accident case is under-discussed...
If I was given conclusive evidence that IIT was false and that extraordinarily robust safeguards against catastrophic outcomes from instrumental convergence were put in place, such that the odds of both of those were 0%, I'd still be at a 1% catastrophic risk from the industrial case.
Stronger ML and AI systems will absolutely enable more dangerous powerful scientific experiments like the Trinity Test, Large Hadron Collider, gain-of-function research done at rapid speeds, and so on.
If the protocols or assumptions on such an experiment are gotten wrong even once, that could be the end of it. And in the world where IIT was false and instrumental convergence doesn't happen, it's likely we're not getting the highest theoretical power of ML/AI systems, humans are still firmly in charge, and this seems like something we could very much screw up.
So, n=3 odds and my personal breakdown on mine. While my odds are probably much better-informed than the general public, I don't think they're per se better than well-informed people. But I see a lack of numbers and reasoning in public, and thought it would be useful to share these.
Just wanted to remind that Scott Aaronson pointed out 8 years ago that strong IIT is nonsense (https://scottaaronson.blog/?p=1799).
I've read it. There was some back-and-forth between him and Tegmark on the topic.
Aaronson calculates, Tegmark speculates.