Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.
Subscribe here to receive future versions.
AI systems pose a variety of different risks. Renowned AI scientist Yoshua Bengio recently argued for one particularly concerning possibility: that advanced AI agents could pursue goals in conflict with human values.
Human intelligence has accomplished impressive feats, from flying to the moon to building nuclear weapons. But Bengio argues that across a range of important intellectual, economic, and social activities, human intelligence could be matched and even surpassed by AI.
How would advanced AIs change our world? Many technologies are tools, such as toasters and calculators, which humans use to accomplish our goals. AIs are different, Bengio says. We often give them a goal and ask them to figure out a solution on their own.
Choosing safe goals for AI systems is an unsolved problem, both technically and politically. If we do not solve this problem, Bengio argues that we could end up building AI agents that pursue harmful goals with superhuman intelligence, which would result in a catastrophe for humanity.
Four steps to rogue AI. Bengio begins by defining rogue AI as “an autonomous AI system that could behave in ways that would be catastrophically harmful to a large fraction of humans, potentially endangering our societies and even our species or the biosphere.”
He argues that a rogue superintelligent AI agent is possible, in four steps:
Why would an AI’s goals conflict with humanity? Bengio offers a variety of reasons why the goals of an AI system might not peacefully coexist with human values.
How to minimize the risk of rogue AI. Bengio recommends more research on AI safety, both the technical level and the policy level. He previously signed the letter calling for a pause on building bigger AI systems, and he again recommends slowing AI development and deployment. He argues that AI agents that pursue goals and take actions are uniquely risky, and recommends allowing AI to answer questions and make predictions without taking actions in the world. Finally, Bengio says “it goes without saying that lethal autonomous weapons (also known as killer robots) are absolutely to be banned.”
Individuals, governments, and AI developers are all interested in understanding the risks of new AI systems. But this can be difficult. AIs often learn unexpected skills during training which might not be fully understood until after people start using the model.
To measure an AI’s abilities, researchers often build evaluation datasets. Computer vision AI models can be evaluated by asking them to classify pictures of cats and dogs, while a language model might be tested with classifying the sentiment of movie reviews. By crafting a set of inputs and desired outputs, researchers can define the kinds of behavior they want AI models to exhibit.
A new paper from Google DeepMind proposes a framework for screening AI systems for extreme risks. The paper outlines key threats posed by AIs, discusses how to screen an individual AI for potential risks, and provides a roadmap for governments and AI developers to incorporate these risk evaluations into their work.
Focusing on extreme risks. A 2022 survey of AI researchers showed that 36% of respondents believe that AI “could cause a catastrophe this century that is at least as bad as an all-out nuclear war.” AI has only accelerated since then, with ChatGPT and GPT-4 both released after this survey. Given the serious possibility of catastrophe, this paper focuses on extreme risks posed by AI.
AIs pose a wide variety of risks, and as they develop new capabilities, new risks arise.
Could someone use the AI to cause harm? One reason for AIs to cause harm is that humans could intentionally use AIs for destructive purposes. Therefore, the paper suggests identifying specific ways that AIs could cause harm, and evaluating whether a specific model is capable of causing that harm. For example:
Might the AI cause harm on its own? AIs might cause harm in the real world even if nobody intends for them to do so. Previous research has shown that deception and power-seeking are useful ways for AIs to achieve real world goals. Theoretically, there are reasons to believe that an AI might attempt to resist being turned off. AIs that successfully gain power and self-propagate will be more numerous and influential in the future.
AIs should be deployed gradually, if and only if risk evaluations show that they’re safe.
How to respond to risk evaluations. Understanding the risks of an AI system is not enough. The paper recommends that AI developers integrate risk evaluations into the training process by making grounded predictions about how dangerous capabilities might arise, and trying to avoid building systems with those risks. Once an AI has been developed, risk evaluations can inform how to ensure that AI capabilities can only be used safely. Governments and citizens can use information in risk evaluations to provide democratic input on the process of developing AI.
When should an AI criticize or support public figures? Should AIs offer opinions, and how should they represent the views of different groups of people? Should there be limits on the kinds of content that an AI system can generate?
Corporations that build AI often answer these questions without meaningful input from the people who are affected by the decisions. But better answers are possible through democratic processes.
By allowing a wide array of people to hold discussions and draw conclusions together, a democratic process can help us decide how AIs should behave. Democratic processes are already used to govern technology, such as by Wikipedia in deciding how to write encyclopedia articles and by Twitter in deciding if, when, and how to fact-check misleading Tweets.
OpenAI will be awarding 10 grants of $100,000 each to support work on democratic governance of AI. Anybody is welcome to submit a proposal for a democratic process that would facilitate deliberation and decisions about how AIs should act. Proposals will be assessed on features such as inclusiveness, legibility, actionability, and ease of evaluating the method’s success. Ten successful applicants will receive $100,000 grants to pilot their proposal over the next three months by using their method to democratically answer a difficult question in AI governance.
Applications are due on June 24th, 2023. Read more and apply here.
See also: CAIS website, CAIS twitter, A technical safety research newsletter