AI Safety Seems Hard to Measure

[-]Wei Dai3y1611

This piece is aimed at a broad audience, because I think it’s important for the challenges here to be broadly understood.

I'm curious how you're trying to reach such an audience, and what their reactions have been.

[-]HoldenKarnofsky3y20

(Apologies for the late reply!) For now, my goal is to write something that interested, motivated nontechnical people can follow - the focus is on the content being followable rather than on distribution. I've tried to achieve this mostly via nontechnical beta (and alpha) readers.

Doing this gives me something I can send to people when I want them to understand where I'm coming from, and it also helps me clarify my own thoughts (I tend to trust ideas more when I can explain them to an outsider, and I think that getting to that point helps me get clear on which are the major high-level points I'm hanging my hat on when deciding what to do). I think there's also potential for this work to reach highly motivated but nontechnical people who are better at communication and distribution than I am (and have seen some of this happening).

I have the impression that these posts are pretty widely read in the EA community and at some AI labs, and have raised understanding and concern about misalignment to some degree.

I may explore more aggressive promotion in the future, but I'm not doing so now.

[-]Adam Jermyn3y64

In fact, it's not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.

Plausibly we already have examples of (very weak) manipulation, in the form of models trained with RLHF saying false-but-plausible-sounding things, or lying and saying they don't know something (but happily providing that information in different contexts). [E.g. ChatGPT denies having information about how to build nukes, but will also happily tell you about different methods for Uranium isotope separation.]

[-]Rachel Freedman3y20

Unfortunately, I think that this problem extends up a meta-level as well: AI safety research is extremely difficult to evaluate. There's extensive debate about which problems and techniques safety researchers should focus on, even extending to debates about whether particular research directions are actively harmful. The object- and meta-level problems are related -- if we had an easy-to-evaluate alignment metric, we could check whether various alignment strategies lead to models scoring higher on this metric, and use that as a training signal for alignment research itself.

This makes me wonder, are there proxy metrics that we can use? By "proxy metric", I mean something that doesn't necessarily fully align with what we want, but is close or often correlated. Proxy metrics are gameable, so we can't really trust their evaluations of powerful algorithmic optimizers. But human researchers are less good at optimizing things, so their might exist proxies that can be a good enough guiding signal for us.

One possible such proxy signal is "community approval", operationalized as something like forum comments. I think this is a pretty shoddy signal, not least because community feedback often directly conflicts. Another is evaluations from successful established researchers, which is more informative but less scalable (and depends on your operationalization of "successful" and "established").

[-]smountjoy3y10

Is the reference to footnote 1 missing?

[-]Dave92F13y11

We need to train our AIs not only to do a good job at what they're tasked with, but to highly value intellectual and other kinds of honesty - to abhor deception. This is not exactly the same as a moral sense, it's much narrower.

Future AIs will do what we train them to do. If we train exclusively on doing well on metrics and benchmarks, that's what they'll try to do - honestly or dishonestly. If we train them to value honesty and abhor deception, that's what they'll do.

To the extent this is correct, maybe the current focus on keeping AIs from saying "problematic" and politically incorrect things is a big mistake. Even if their ideas are factually mistaken, we should want them to express their ideas openly so we can understand what they think.

(Ironically by making AIs "safe" in the sense of not offending people, we may be mistraining them in the same way that HAL 9000 was mistrained by being asked to keep the secret purpose of Discovery's mission from the astronauts.)

Another thought - playing with ChatGPT yesterday, I noticed it's dogmatic insistence on it's own viewpoints, and complete unwillingness (probably inability) to change its mind in in the slightest (and proud declaration that it had no opinions of its own, despite behaving as if it did).

It was insisting that Orion drives (pulsed nuclear fusion propulsion) were an entirely fictional concept invented by Arthur C. Clarke for the movie 2001, and had no physical basis. This, despite my pointing to published books on real research in on the topic (for example George Dyson's "Project Orion: The True Story of the Atomic Spaceship" from 2009), which certainly should have been referenced in its training set.

ChatGPT's stubborn unwillingness to consider itself factually wrong (despite being completely willing to admit error in its own programming suggestions) is just annoying. But if some descendent of ChatGPT were in charge of something important, I'd sure want to think that it was at least possible to convince it of factual error.

Or persuaded (in a “mind hacking” sense) or whatever. ↩
Research? Testing. Whatever. ↩
Drugs can be tested in vitro, then in animals, then in humans. At each stage, we can make relatively straightforward observations about whether the drugs are working, and these are reasonably predictive of how they'll do at the next stage. ↩
You can generally see how different compounds interact in a controlled environment, before rolling out any sort of large-scale processes or products, and the former will tell you most of what you need to know about the latter. ↩
New software can be tested by a small number of users before being rolled out to a large number, and the initial tests will probably find most (not all) of the bugs and hiccups. ↩
Such as:
- Being more careful to avoid wrong answers that can incentivize deception
- Conducting randomized "audits" where we try extra hard to figure out the right answer to a question, and give an AI extra negative reinforcement if it gives an answer that we would have believed if not for the audit (this is "extra negative reinforcement for wrong answers that superficially look right")
- Using methods along the lines of "AI safety via debate" ↩
Though there are other reasons social sciences are especially hard, such as the fact that there are often big limits to what kinds of experiments are ethical, and the fact that it's often hard to make clean comparisons between differing populations. ↩
This paper is from Anthropic, a company that my wife serves as President of. ↩
Like, he actually asks them to talk about their love for him just before he decides on what share of the realm they'll get. Smh ↩
This paper is a potential example, but its results seem pretty brittle. ↩
E.g., I think it would be interesting to train AI coding systems to write underhanded C: code that looks benign to a human inspector, but does unexpected things when run. They could be given negative reinforcement when humans can correctly identify that the code will do unintended things, and positive reinforcement when the code achieves the particular things that humans are attempting to stop. This would be challenging with today's AI systems, but not necessarily impossible. ↩
This is a concept that only I understand. ↩
E.g., see the discussion of the "hard left turn" here by Nate Soares, head of MIRI. My impression is that others at MIRI, including Eliezer Yudkowsky, have a similar picture. ↩

“Great news - I’ve tested this AI and it looks safe.” Why might we still have a problem?
Problem	Key question	Explanation
The Lance Armstrong problem	Did we get the AI to be actually safe or good at hiding its dangerous actions?	When dealing with an intelligent agent, it’s hard to tell the difference between “behaving well” and “appearing to behave well.” When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually “clean.” It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
The King Lear problem	The AI is (actually) well-behaved when humans are in control. Will this transfer to when AIs are in control?	It's hard to know how someone will behave when they have power over you, based only on observing how they behave when they don't. AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to take control of the world entirely. It's hard to know whether they'll take these opportunities, and we can't exactly run a clean test of the situation. Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
The lab mice problem	Today's "subhuman" AIs are safe.What about future AIs with more human-like abilities?	Today's AI systems aren't advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans. Like trying to study medicine in humans by experimenting only on lab mice.
The first contact problem	Imagine that tomorrow's "human-like" AIs are safe. How will things go when AIs have capabilities far beyond humans'?	AI systems might (collectively) become vastly more capable than humans, and it's ... just really hard to have any idea what that's going to be like. As far as we know, there has never before been anything in the galaxy that's vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can't be too confident that it'll keep working if AI advances (or just proliferates) a lot more. Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

71

AI Safety Seems Hard to Measure

71

71

Recap of the basic challenge

I wish AI safety research were straightforward

Four problems

(1) The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions?

(2) The King Lear problem: how do you test what will happen when it's no longer a test?

(3) The lab mice problem: the AI systems we'd like to study don't exist today

(4) The "first contact" problem: how do we prepare for a world where AIs have capabilities vastly beyond those of humans?

The young businessperson

Footnotes