Paul Christiano paints a vivid and disturbing picture of how AI could go wrong, not with sudden violent takeover, but through a gradual loss of human control as AI systems optimize for the wrong things and develop influence-seeking behaviors.
The original draft of Ayeja's report on biological anchors for AI timelines. The report includes quantitative models and forecasts, though the specific numbers were still in flux at the time. Ajeya cautions against wide sharing of specific conclusions, as they don't yet reflect Open Philanthropy's official stance.
"Wait, dignity points?" you ask. "What are those? In what units are they measured, exactly?"
And to this I reply: "Obviously, the measuring units of dignity are over humanity's log odds of survival - the graph on which the logistic success curve is a straight line. A project that doubles humanity's chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity."
"But if enough people can contribute enough bits of dignity like that, wouldn't that mean we didn't die at all?" "Yes, but again, don't get your hopes up."
A story in nine parts about someone creating an AI that predicts the future, and multiple people who wonder about the implications. What happens when the predictions influence what future happens?
Will AGI progress gradually or rapidly? I think the disagreement is mostly about what happens before we build powerful AGI.
I think weaker AI systems will already have radically transformed the world. This is strategically relevant because I'm imagining AGI strategies playing out in a world where everything is already going crazy, while other people are imagining AGI strategies playing out in a world that looks kind of like 2018 except that someone is about to get a decisive strategic advantage.
Historically people worried about extinction risk from artificial intelligence have not seriously considered deliberately slowing down AI progress as a solution. Katja Grace argues this strategy should be considered more seriously, and that common objections to it are incorrect or exaggerated.
An open letter called for “all AI labs to immediately pause for at least 6 months the training of AI more powerful than GPT-4.” This 6-month moratorium would be better than no moratorium. I have respect for everyone who stepped up and signed it.
I refrained from signing because I think the letter is understating the seriousness of the situation and asking for too little to solve it.
A "good project" in AGI research needs:1) Trustworthy command, 2) Research closure, 3) Strong operational security, 4) Commitment to the common good, 5) An alignment mindset, and 6) Requisite resource levels.
The post goes into detail on what minimal, adequate, and good performance looks like.
A fictional story about an AI researcher who leaves an experiment running overnight.
Imagine if all computers in 2020 suddenly became 12 orders of magnitude faster. What could we do with AI then? Would we achieve transformative AI? Daniel Kokotajlo explores this thought experiment as a way to get intuition about AI timelines.
Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil discuss their differing AI forecasts. Key topics include the importance of transfer learning, AI's potential to accelerate R&D, and the expected trajectory of AI capabilities. They explore concrete scenarios and how observations might update their views.
Daniel Kokotajlo presents his best attempt at a concrete, detailed guess of what 2022 through 2026 will look like, as an exercise in forecasting. It includes predictions about the development of AI, alongside changes in the geopolitical arena.
"Human feedback on diverse tasks" could lead to transformative AI, while requiring little innovation on current techniques. But it seems likely that the natural course of this path leads to full blown AI takeover.
Tom Davidson analyzes AI takeoff speeds – how quickly AI capabilities might improve as they approach human-level AI. He puts ~25% probability on takeoff lasting less than 1 year, and ~50% on it lasting less than 3 years. But he also argues we should assign some probability to takeoff lasting more than 5 years.
Richard Ngo lays out the core argument for why AGI could be an existential threat: we might build AIs that are much smarter than humans, that act autonomously to pursue large-scale goals, whose goals conflict with ours, leading them to take control of humanity's future. He aims to defend this argument in detail from first principles.
In the span of a few years, some minor European explorers (later known as the conquistadors) encountered, conquered, and enslaved several huge regions of the world. Daniel Kokotajlo argues this shows the plausibility of a small AI system rapidly taking over the world, even without overwhelming technological superiority.
In thinking about AGI safety, I’ve found it useful to build a collection of different viewpoints from people that I respect, such that I can think from their perspective. I will often try to compare what an idea feels like when I put on my Paul Christiano hat, to when I put on my Scott Garrabrant hat. Recently, I feel like I’ve gained a "Chris Olah" hat, which often looks at AI through the lens of interpretability.
The goal of this post is to try to give that hat to more people.
A vignette in which AI alignment turns out to be hard, society handles AI more competently than expected, and the outcome is still worse than hoped.
Katja Grace provides a list of counterarguments to the basic case for existential risk from superhuman AI systems. She examines potential gaps in arguments about AI goal-directedness, AI goals being harmful, and AI superiority over humans. While she sees these as serious concerns, she doesn't find the case for overwhelming likelihood of existential risk convincing based on current arguments.
Eric Drexler's CAIS model suggests that before we get to a world with monolithic AGI agents, we will already have seen an intelligence explosion due to automated R&D. This reframes the problems of AI safety and has implications for what technical safety researchers should be doing. Rohin reviews and summarizes the model
GDP isn't a great metric for AI timelines or takeoff speed because the relevant events (like AI alignment failure or progress towards self-improving AI) could happen before GDP growth accelerates visibly. Instead, we should focus on things like warning shots, heterogeneity of AI systems, risk awareness, multipolarity, and overall "craziness" of the world.
This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause.
The field of AI alignment is growing rapidly, attracting more resources and mindshare each year. As it grows, more people will be incentivized to misleadingly portray themselves or their projects as more alignment-friendly than they are. Adam proposes "safetywashing" as the term for this
We shouldn't expect to get a lot more worried about AI risk as capabilities increase, if we're thinking about it clearly now. Joe discusses why this happens anyway, and how to avoid it.
Some people believe AI development is extremely dangerous, but are hesitant to directly confront or dissuade AI researchers. The author argues we should be more willing to engage in activism and outreach to slow down dangerous AI progress. They give an example of their own intervention with an AI research group.
The plan of "use AI to help us navigate superintelligence" is not just technically hard, but organizationally hard. If you're building AGI, your company needs a culture focused on high reliability (as opposed to, say, "move fast and break things."). Existing research on "high reliability organizations" suggests this culture requires a lot of time to develop. Raemon argues it needs to be one of the top few priorities for AI company leadership.
nostalgebraist argues that GPT-2 is a fascinating and important development for our understanding of language and the mind, despite its flaws. They're frustrated that many psycholinguists who previously studied language in detail now seem uninterested in looking at what GPT-2 tells us about language, instead focusing on whether it's "real AI".
Larger language models (LMs) like GPT-3 are certainly impressive, but nostalgebraist argues that their capabilities may not be quite as revolutionary as some claim. He examines the evidence around LM scaling and argues we should be cautious about extrapolating current trends too far into the future.
AI safety researchers have different ideas of what success would look like. This post explores five different AI safety "success stories" that researchers might be aiming for and compares them along several dimensions.
Nate Soares gives feedback to Joe Carlsmith on his paper "Is power-seeking AI an existential risk?". Nate agrees with Joe's conclusion of at least a 5% chance of catastrophe by 2070, but thinks this number is much too low. Nate gives his own probability estimates and explains various points of disagreement.
We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.
Instead, it's the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk. This might happen years before classic milestones like "World GWP doubles in four years" and "Superhuman AGI is deployed."
Eliezer Yudkowsky recently criticized the OpenPhil draft report on AI timelines. Holden Karnofsky thinks Eliezer misunderstood the report in important ways, and defends the report's usefulness as a tool for informing (not determining) AI timelines.
The practice of extrapolating AI timelines based on biological analogies has a long history of not working. Eliezer argues that this is because the resource gets consumed differently, so base-rate arguments from resource consumption end up quite unhelpful in real life.
Timelines are inherently very difficult to predict accurately, until we are much closer to AGI.
Orpheus16 shares his experience talking with ~60 congressional staffers about AI risk in May - June 2023. He found staffers were surprisingly open-minded about AI risks but often lacked knowledge. His guess is that the Overton window on AI policy is wide, more coordination is needed on specific policy proposals, and there are opportunities for more people to engage productively with policymakers on AI issues if done thoughtfully.