AISN #12: Policy Proposals from NTIA’s Request for Comment and Reconsidering Instrumental Convergence

Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Subscribe here to receive future versions.

Policy Proposals from NTIA’s Request for Comment

The National Telecommunications and Information Administration publicly requested comments on the matter from academics, think tanks, industry leaders, and concerned citizens. They asked 34 questions and received more than 1,400 responses on how to govern AI for the public benefit. This week, we cover some of the most promising proposals found in the NTIA submissions.

Technical Proposals for Evaluating AI Safety

Several NTIA submissions focused on the technical question of how to evaluate the safety of an AI system. We review two areas of active research: red-teaming and transparency.

Red Teaming: Acting like an Adversary

Several submissions proposed government support for evaluating AIs via red teaming. In this evaluation method, a “red team” deliberately tries to make an AI system fail or behave in dangerous ways. By identifying risks from AI models, red teaming helps AI developers decide priorities for safety research and whether or not to deploy a new model.

Red teaming is a dynamic process. Attackers can vary their methods and search for multiple different undesirable behaviors in order to explore new potential risks. Static benchmarks, on the other hand, evaluate well-known risks by giving models a written test. For example, CAIS built a benchmark that measures whether AIs behave immorally in text-based scenarios.

Red teaming can be performed by either internal or external teams. For internal red teaming to be effective, internal auditors need independence to report their results without fear of retribution or interference from company executives. External auditors can complement internal efforts, as suggested by several NTIA submissions. But collaborating with external researchers requires strong information security. Otherwise, AI systems could be leaked outside of the intended audience, as happened with Meta’s LLaMa model.

Transparency: Understanding AIs From the Inside

Other submissions advocated research on understanding how AIs make decisions. Known as transparency, interpretability, or explainability, this has been a perennial goal of AI research, but more advanced AI systems have largely become more inscrutable over time.

The field has often fallen prey to false hopes, such as the idea that we can gain valuable information from post-hoc explanations of AI decisions. Suppose a corporate chatbot says that you should drink Coca Cola because it’s “delicious” and “an American classic.” This might seem like a plausible explanation for its recommendation. But if you later learn that the chatbot’s creator has an advertising partnership with Coca-Cola, you won’t have any doubt about the real reason it wanted you to drink Coke. These false explanations can be worse than treating AIs as a black box, because they encourage people to trust AI decisions when their reasoning processes remain opaque.

Beyond the inner workings of an AI system, transparency about the training and deployment process can be helpful. Hugging Face recommended using data sheets and data statements to disclose the data used during model training, which could help ensure that developers do not use datasets with known bias, misinformation, or copyrighted material. Similarly, Microsoft argued that AI providers should always identify videos, images, or text produced by their AIs.

Governance Proposals for Improving Safety Processes

Ensuring that AI developers implement the most advanced methods for ensuring AI safety will be a challenge unto itself. Startups often have a “move fast, break things” mindset that might work for free-to-use websites and smartphone applications, but could prove dangerous when building a technology that poses societal-scale risks. Several NTIA submissions therefore proposed governance mechanisms for ensuring that best practices for safety are adopted by the organizations developing and deploying AIs.

Requiring a License for Frontier AI Systems

At the Senate hearing on AI, OpenAI CEO Sam Altman recommended “a new agency that licenses any effort above a certain scale of capabilities and can take that license away and ensure compliance with safety standards.” Several submissions, including the Center for AI Safety’s submission, supported a licensing system that could reduce pressure on companies to race ahead while cutting corners on safety, instead promoting best practices in AI safety.

Startups and open source developers would likely be unaffected by these requirements, as current proposals only require licenses for a small handful “frontier” AI systems such as OpenAI’s GPT-4 and Google’s PaLM. Before training a frontier model, companies could be required to strengthen their information security so that adversaries cannot steal their models, and improve their corporate governance with gradual deployment of models, incident response plans, and internal reviews of potentially dangerous research before publication.

Licensing could also encourage the development of better techniques for evaluating AI safety. After a company develops an AI system, they could be required to affirmatively demonstrate its safety, an application of the precautionary principle resembling how drug developers must prove their products safe to the FDA. This would incentivize companies to invest in model evaluation techniques such as red teaming and transparency.

If the federal government would rather not directly license models themselves, Anthropic suggested that they could depend upon the expertise of third-party auditors. Auditors like the Big Three credit rating agencies are relied upon by the SEC to produce accurate analysis of financial products. After the Enron scandal, Congress supported financial auditors by passing the Sarbanes-Oxley Act which, among other things, made it illegal for corporate executives to falsify information submitted to auditors. Lawmakers could take similar steps to support AI auditors with the technical expertise to ensure that AI developers are meeting our public goals for safety.

Unifying Sector-Specific Expertise and General AI Oversight

Many federal agencies have long-standing expertise in regulating particular applications of AI. The FTC recently warned against using AI deceptively to trick consumers into making harmful purchases or decisions. Similarly, the National Institute of Justice recently hosted a research symposium on how to productively use AI algorithms in the criminal justice system. Several NTIA submissions, including those from OpenAI and Google, highlighted the critical need in AI governance for federal agencies with expertise in these critical areas.

AI systems with a wide range of capabilities, such as recent language models, might stretch the limits of a sector-specific approach to AI governance. The European Union has been considering this challenge recently, with many parliamentarians calling for policies that specifically address general purpose AI systems.

Several NTIA submissions argued that the United States might need to similarly adapt to a world of more general AI systems. For example, the Center for Democracy and Technology wrote in their submission that “existing laws such as civil rights statutes provide basic rules that continue to apply, but those laws were not written with AI in mind and may require change and supplementation to serve as effective vehicles for accountability.”

Does Antitrust Prevent Cooperation Between AI Labs?

Competitive pressure between AI labs might lead them to release new models too quickly, or with dangerous capabilities. In order to reduce that risk, we may want AI labs to cooperate with each other on safe AI development. This might include sharing safety testing results and methods. Or, it might involve active collaboration. OpenAI’s charter even includes a clause to merge with and assist another organization if the latter was likely to create an AGI soon.

However, as Anthropic notes in its comment, it’s possible that by cooperating to improve safety, AI labs would run afoul of existing antitrust laws. They suggest that regulators should clarify antitrust guidelines as to when coordination between AI labs should be allowed.

Reconsidering Instrumental Convergence

As the field of AI safety grows, it’s important to continue questioning and refining our beliefs on the topic. One common argument is the instrumental convergence thesis, which holds that regardless of an agent's final goal, it is likely for it to be rational to pursue certain subgoals, such as power-seeking and self-preservation, in service of that goal.

A new draft paper from CAIS questions this claim. Power and self-preservation are absolutely useful for achieving many goals that we might care about, the paper recognizes. But it does not logically follow that agents will pursue power and self-preservation in most (or any) circumstances. There might be costs involved in pursuing these goals, including the opportunity cost of time and effort not spent on other strategies, and success might not be guaranteed. Further, AI agents could have aversions to gaining power and self-preservation, perhaps as the result of intentional design by AI developers. The paper shows mathematically that if the desires of an agent are initialized randomly (in line with the so-called orthogonality thesis, which claims that any goals are compatible with any level of intelligence), there is no reason to think that the agent will be power-seeking or act to preserve itself. A simple analogy to humans applies here: Some of our goals would be easier to attain if we were immortal or omnipotent, but few choose to spend their lives in pursuit of these goals.

This is not an argument that AI agents will never pursue power: the goals of AI systems won't be randomly chosen. Empirically, research shows that AI agents trained to maximize performance in text-based games often lie, cheat, and steal to improve their scores. From a higher level perspective, agents that successfully self-propagate will have more influence in the future than other agents. There are many reasons to believe that AIs will often pursue unexpected and even dangerous goals; this paper simply argues that this would not be true of agents with randomly-initialized goals.

6