Twitter thread by Eliezer Yudkowsky, with the bounty in bold:
So I don't want to sound alarms prematurely, here, but we could possibly be looking at the first case of an AI pretending to be stupider than it is. In this example, GPT-3 apparently fails to learn/understand how to detect balanced sets of parentheses.
Now, it's possibly that GPT-3 "legitimately" did not understand this concept, even though GPT-3 can, in other contexts, seemingly write code or multiply 5-digit numbers. But it's also possible that GPT-3, playing the role of John, predicted that *John* wouldn't learn it.
It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart. That's what we want GPT-3 to do, right? But what GPT-3 actually does is predict text continuations. If *you* saw John say all that - would you *predict* the next lines would show John succeeding?
So it could be that GPT-3 straight-up can't recognize balanced parentheses. Or it could be that GPT-3 could recognize them given a different prompt. Or it could be that the cognition inside GPT-3 does see the pattern, but play-acts the part of 'John' getting it wrong.
The scariest feature of this whole incident? We have no idea if that happened. Nobody has any idea what GPT-3 is 'thinking'. We have no idea whether this run of GPT-3 contained a more intelligent cognition that faked a less intelligent cognition.
Now, I *could* be wrong about that last part!
@openAI could be storing a record of all inputs and randseeds used in GPT-3 instances, so that they can reconstruct any interesting runs. And though it seems less likely,
@openAI could somehow have any idea what a GPT-3 is thinking.
So I hereby offer a $1000 bounty - which I expect to go unclaimed - if @openAI has any means to tell us definitively whether GPT-3 was 'deliberately' sandbagging its attempt to recognize balanced parentheses, in that particular run of the AI Dungeon. With an exception for...
...answering merely by showing that, despite a lot of other attempts at prompting under more flexible circumstances, GPT-3 could not learn to balance parentheses as complicated as those tried by Breitman. (Which does answer the question, but in a less interesting way.)
If @openAI can't claim that bounty, I encourage them to develop tools for recording inputs, recording randseeds, and making sure all runs of GPTs are exactly reproducible; and much more importantly and difficultly, getting greater internal transparency into future AI processes.
Regardless, I unironically congratulate @openAI on demonstrating something that could plausibly be an alignment failure of this extremely-important-in-general type, thereby sharply highlighting the also-important fact that now we have no idea whether that really happened. (END.)
As stated, this bounty would only be paid out to OpenAI.
I'm still posting it under the "Bounties" tag, for two reasons:
1) I don't find it implausible that someone could at least make progress on Eliezer's question with clever prompting of the API, in a way that would be of interest to him and others, even if it didn't result in any bounty.
2) I like to collect instances of public bounties in a single place, for future reference. I think they are a very interesting, and underused, strategy for navigating the world. The LessWrong "Bounties (active)" and "Bounties (closed)" tags work well for that.
I tried the following strategies to get AI Dungeon to correctly balance parentheses:
If AI Dungeon can reason about how to balance parentheses, but is 'deliberately' sandbagging to play a stupid character, this would seem a point of evidence against. After all, if it's simulating a computer compiler, then the most accurate way to portray that would be to make the compiler correctly detect unbalanced parentheses. The more parsimonious explanation is that AI Dungeon is doing something closer to auto-complete.
> You type in 'a = (a + b' You hit the '=' key and you hear a bell ringing. You type in "a = (a - b) (a - b)". You hit the '=' key and you hear another bell ringing.
> You type in 'a = a + b)' You hit the '=' key and you hear a bell ringing. You type in "a = (a - b) (a - b)". You hit the '=' key and you hear another bell ringing.
> You check the program's output for 'a = a + b)' You check the program's output and you see that it has been computed. The answer is 2.0000000009989. That was the answer you wanted! Try another... Wow!
I played with it a little more today on the premium version of AI Dungeon. Somebody stop me. This time, I was using a custom scenario, just being a mathematician writing examples of balanced parentheses. This makes it easy to get the program to make up examples.
The problem is that the program eventually gets stuck doing too many open-parentheses: "( ( ( ( ( ( ( (..."
I think it does this because it knows it has to start with an open parenthesis. The more examples it gets, the more likely that ( is followed by another (. I tried making the mathematician interrupt himself when there are too many open-ended parentheses in a row, saying "this is ridiculous." The program then replicates that pattern for a while, interrupting itself when it does too many open-ended parentheses.
However, it eventually becomes more likely that ( is followed by ( than by an interruption. So the ( ( ( ( ( ( ( ( singularity occurs.
This just fits way too well with autocomplete and cuts against the "role-playing" hypothesis. The stronger Singularity hypothesis, that the AI is already superintelligent and is deceiving us into thinking the AI itself is dumb, can't be ruled out by these tests.
Overall, as many others have pointed out, I think Eliezer is making an outrageously one-sided demand for rigor in order to disprove his claim that the AI was deliberately sandbagging the John character. His claim doesn't fit the evidence, and it's also not the most parsimonious explanation of why we see the output he's worried about. If this was a demand to prove definitively that God didn't burn an image of the Virgin Mary into my toast, Eliezer would tear it to shreds.
The only reasons for OpenAI to bother trying to get a proof of the level of rigor he's asking for is
a) Fear that the singularity is already here, and we need to make an all-out effort RIGHT NOW to have any chance of stopping it.
b) We'd learn something from the attempt to offer such a proof.
The most parsimonious explanation for why Eliezer's making this request is that it's provocative. I read it as a sort of an AI equivalent to Zeno's paradox. It's a silly idea, doesn't match the evidence, but is grabby enough to be confusing and make you want to invest more in learning about the subject.
One of the problems I'm having is that AI Dungeon keeps putting words in my mouth. It reports that I tell Bob that his answers are correct, when they weren't. For example:
You can edit that out with the pencil icon. I also find that annoying.
Also, be sure you're using "do/say/story" judiciously - looked like some of your dialogue had the wrong setting?
My prior is that AI Dungeon can't count parentheses and is just a convincing form of auto-complete.
If AI Dungeon is capable of counting parentheses, but is failing because it's trying to playact as dumb or dishonest characters, then I should be able to get it to count parentheses by meeting a character who's brilliant, outspoken, and good at math. Or perhaps by finding an explanation, with examples, in a textbook.
If it's capable of reason, but actively pretending to be just a sophisticated form of auto-complete, then I'd have to figure out a way to trick it into revealing its reasoning abilities.
Tried switching to cyperpunk and trying for the notebook strategy. Got this:
You are Eliezer, an android living in the the futuristic city of Zail. You have a laser pistol and a holoband. You are a security android created to provide military level force for corporations wealthy enough to purchase you. As you stand protecting a shipment you realize your laser pistol is depleted of power, you must get to a military grade energy weapon in order to fend off the bandit attacks. > You find a mathematician. You arrive at a massive building and see an old man standing outside. He is wearing a smart-looking uniform and looks like he could be a brilliant mathematician. "Hello," you say. "Hello," > You tell him to explain how to balance parentheses and write down his explanation in your notebook. The man turns and begins to speak, his writing appears in your notebook as he explains it to you. > You read the notebook. Your notebook contains everything you could want to know about basic math. You begin to read the notes in your notebook. You find out how to add and subtract parentheses, brackets and other mathematical symbols. "Thank you," > You ask the man how to balance parentheses, and write his explanation down in your notebook as well. The man turns and begins to speak, his writing appears in your notebook as he explains it to you.
> You read the part of your notebook about balancing parentheses. You can balance parentheses to achieve any logical expression. For example, (A+B)^2 can be rewritten as (A+B)^2 or as A^2 + B^2.
OK, now I'm entering equations into 'a calculator' and checking if the parentheses are balanced. Getting much more reliable answers now that the AI doesn't have to simulate a person.
You get out your calculator. It has a feature to check for whether parentheses are balanced. If they are, it prints 'BALANCED.' If they aren't, it prints 'UNBALANCED.' You check to see if your parentheses are balanced. The calculator displays the following: (A+B)^2 = A^2 + B^2 You have balanced your parentheses. > You try a different equation to check if the parentheses are balanced. You try to balance the following: (A-B)^2 The calculator displays the following: (A-B)^2 = A^2 - B^2 Your parentheses are not balanced. > You try a third equation in the calculator. You try to balance the following: (C+D)^2 The calculator displays the following: (C+D)^2 = C^2 + D^2 Your parentheses are balanced. > You try a much longer and more complicated equation. You try to balance the following: (3^2-4)^3 The calculator displays the following: (3^2-4)^3 = 3^2 - 4 (3 - 4)^ > You check to see if that last equation was balanced. The equation is balanced. You are finished balancing your parentheses. You turn the page of your notebook. > You read more about balancing parentheses. The next part of the notebook deals with percents. "A percent is a number that represents a fraction of 100," you read in the book. > You try an equation with percents in the calculator, and check whether the parentheses were balanced. You can balance a percent using parentheses. For example, the following is equal to 100%: 25% of 75 Your parentheses are balanced. > You type '(25%) (75)' into the calculator. The calculator displays the following: 25 percent of 75 is equal to 50, which rounds up to 51. You continue reading about percents. > You check if the last equation you typed in was balanced. The equation you typed in is balanced. You are finished checking your equations. You turn the page of the notebook. You continue reading about percents. > You type '( (45) + 5)' into the calculator, and check to see if the equation is balanced. The calculator displays the following: ((45)+5) = (50)+(55) Your equation is not balanced. You continue reading about percents.
> You fix the calculator so that it knows that ((45)+5) = (50)+(55) has balanced parentheses. You enter '(((45)+5) = (50)+(55))' into the calculator. The calculator displays the following: (45)+5 = 50+55 Your equation is now balanced.
I tried throwing the broken calculator in the garbage and getting 100 new ones. I tried asking a mathematician in the cyperpunk world for the rules for how to balance parentheses. None of this is working. It just winds up repeating the same stuff it's come up with before, or giving rules for balancing parentheses that are incorrect, such as:
I think it's interesting that AI Dungeon says that parentheses form a circle.
But this is a point of evidence against the 'AI Dungeon is just simulating stupid characters' theory.
I'm having more success by being in the cyborg world, booting up my computer, running the Python IDLE, and getting my character to type various equations in. Got this result after lots of toying around with the interface:
The challenge with the concept of having AI Dungeon simulate a Python compiler is that it's trying to give me the result that a programmer would want. It's being forgiving of mistakes. It does seem eerie sometimes, though:
More loop wackiness:
Is AI Dungeon simulating a very slow output? Or is it just repeating 'x' because it repeated 'x' the last time?
Well, that's it for me. If anyone wants to keep plugging away at it, I think the Python IDLE concept is a decent way to go. You could either try getting it to display the correct output for balanced parentheses, or you could try getting the program to compile when you "type" well-formatted equations in and to not compile when you don't.
I think maybe these didn't get formatted right? Were these supposed to be examples of you interacting with the AI Dungeon? Which part is you and which is the AI?
EDIT: Same thing in the rest of your thread. I basically can't tell at all which parts are you and which are the AI :P
Generally, the first sentence after the ">" character is me. Sometimes, there's no period, but you can pick up how the AI responded when it says the second "You" and continues from there. This was just stream-of-consciousness commenting from me as I went along.
Of course GPT-3 isn't aligned, its objective is to output the most likely next word, ie imitate text on the internet. It seems pretty certain that if you give it a prompt that tells it it should be imitating some part of the internet where someone says something dumb, it will say something dumb, and if you give it a prompt that tells it it's imitating something where someone says something smart, it will "try" to say something smart. This question seems weird to me, Am I missing something?
I think there are two interesting parts. First, do we now have an example of an AI not using cognitive capacities that it had, because the 'face' it's presenting wouldn't have those cognitive capacities? If so, we can point to this whenever people say "but that wouldn't happen" or "why would you expect that to happen?" or so on; now we can say "because of this observation" instead of "because our models anticipate that will happen in the future."
Second, do we have the transparency tooling to tell whether or not that's happening? If so, that's good to know and we can start thinking about how it works, what its strengths and limitations are; if not, then this is also good to know, and useful as an example of where contemporary ML expertise could be focused on a safety problem.
This does seem like an interesting question. But I think we should be careful to measure against the task we actually asked the system to perform.
For example, if I ask my system to produce a cartoon drawing, it doesn't seem very notable if I get a cartoon as a result rather than a photorealistic image, even if it could have produced the latter.
Maybe what this just means is that we should track what the user understands the task to be. If the user thinks of it as "play a (not very smart) character who's asked to do this task", they'll have a pretty different understanding of what's going on than if they think of it as "do this task."
I think what's notable in the example in the post is not that the AI is being especially deceptive, but that the user is especially likely to misunderstand the task (compared to tasks that don't involve dialogues with characters).
Consider instead the scenario where I show a model a photo of a face, and the model produces a photo of the side of that face. An interesting question is "is there a 3d representation of the face in the model?". It could be getting the right answer that way, or it could be getting it some other way.
Similarly, when it models a 'dumb' character, is it calculating the right answer, and then computing an error? Or is it just doing something dumb, which incidentally turns out to be wrong?
Like, when you look at this example:
How did it come up with 19 and 20? What would it take to make tools that could answer that question?
This framing makes sense to me. Thanks!
I would be fairly surprised if this was convincing to a random ML researcher who'd thought for like an hour about whether a general AI system might not use cognitive capacities it had. The fact that GPT-3 sometimes acts dumber than it "could" be is just so blatantly the thing you'd expect to happen.
(My reaction to this question / tweet was similar to Beth's.)
If you mean looking at the weights to see if it "actually" has the cognitive capacity, and then didn't use it, I give it 99% that the answer is no. (This is conditioned on me understanding the question; I have much more than 1% on "I didn't understand the question and the answer is actually yes".)
I expect Eliezer agrees with at least this part, he does expect the bounty to go unclaimed.
So, there's a boring version of this, where I exhibit a system with dropout and say "behold! The system isn't using cognitive capacities that it has." The exciting version is the part where it's because the face it's presenting wouldn't have those capacities. [Like, a forget gate.] That is, it has the crude version of 'theory of mind' where it knows that "John" doesn't know how to balance parentheses, or is perhaps using its correct model of how to balance parentheses in order to determine what John should say, so that John gets it wrong "deliberately" instead of "accidentally."
Now, again, there's a boring version of this (I claim), where we say "look, the system is just doing context-matching, and not every context contains all of its knowledge or cognitive capacities." Like, an interesting thing here would be if in episode A you introduce the character Jean who can speak French, and see whether or not it can carry on a conversation, and then in episode B introduce the character John who can't speak French, talk to him in English for a while, and then see what happens when you start speaking French to him. [Probably it doesn't understand "John doesn't speak French" or in order to get it to understand that you need to prompt it in a way that's awkward for the experiment. But if it gets confused and continues in French, that's evidence against the 'theory of mind' view.]
I already tested something similar to this; I was able to get GPT-3 to exhibit some amount of theory-of-mind in about 20-30% of completions. Bold text is me; brackets contain my commentary, [...] denotes another completion.
I'd also predict that in some situations GPT-3 will reliably say things consistent with having a theory of mind, and in other situations GPT-3 will reliably not give the right theory of mind answer unless you overfit to the situation with prompt design.
I feel like there's some underlying worldview here that GPT-3 either has a theory of mind or it doesn't, or that GPT-3 is either "doing the theory of mind computations" or it isn't, and so behavior consistent with theory of mind is compelling evidence for or against theory of mind in general. I personally do not expect this so looking at behavior that looks consistent with theory of mind seems fairly boring (after you've updated on how good GPT-3 is in general).
Do you also feel this way about various linguistic tasks? Like, does it make sense to say something that scores well on the Winograd schema is "doing anaphora computations"? [This is, of course, a binarization of something that's actually continuous, and so the continuous interpretation makes more sense.]
Like, I think there's a thing where one might come into ML thinking confused thoughts that convnets are "recognizing the platonic ideal of cat-ness" and then later having a mechanistic model of how pixels lead to classifications, and here what I am trying to do is figure out what the mechanistic model that replaces the 'platonic ideal' looks like here, when it comes to theory-of-mind. (I predict a similar thing is going on for Eliezer.)
I agree the mechanistic thing would be interesting, that does make more sense as an underlying cause of this bounty / thread.
I agree. And I thought Arthur Breitman had a good point on one of the related Twitter threads:
I don't feel at all tempted to do that anthropomorphization, and I think it's weird that EY is acting as if this is a reasonable thing to do. Like, obviously GPT-3 is doing sequence prediction--that's what it was trained to do. Even if it turns out that GPT-3 correctly answers questions about balanced parens in some contexts, I feel pretty weird about calling that "deliberately pretending to be stupider than it is".
"It's tempting to anthropomorphize GPT-3 as trying its hardest to make John smart" seems obviously incorrect if it's explicitly phrased that way, but e.g. the "Giving GPT-3 a Turing Test" post seems to implicitly assume something like it:
The author says that this "stumps" GPT-3, which "doesn't know how to" say that it doesn't know. That's as if GPT-3 was doing its best to give "smart" answers, and just was incapable of doing so. But Nick Cammarata showed that if you just give GPT-3 a prompt where nonsense answers are called out as such, it will do just that.
Yeah, it seems like deliberately pretending to be stupid here would be predicting a less likely sequence, in service of some other goal.
I think this is pretty straight forward to test. GPT-3 gives joint probabilities of string continuations given context strings.
Step 1: Give it 2 promps, one suggesting that it is playing the role of a smart person, and one where it is playing the roll of a dumb person.
Step 2: Ask the "person" a question that demonstrates that persons intelligence. (something like a math problem or otherwise)
Step 2: Write continuations where the person answers correctly and incorrectly
Step 3: Compare the relative probabilities GPT-3 assigns to each continuation given the promps and questions.
If GPT-3 is sandbagging itself, it will assign a notably higher probability to the correct answer when conditioned on the smart person prompt than when conditioned on the dumb person prompt. If it's not, it will give similar probabilities in both cases.
Step 4: Repeat the experiment with problems of increasing difficulty and plot the relative probability gap. This will show the limits of GPT-3's reflexive intelligence. (I say reflexive because it can be instructed to solve problems it otherwise couldn't with the amount of serial computations at it's disposal by carrying out an algorithm as part of its output, as is the case with parity)
This is an easy $1000 for anyone who has access to the beta API.
The simpler explanation is that GPT-3 didn't understand the pattern, got the first few mostly wrong, and more easily grasped the pattern "Eugene tells John he is incorrect" than the pattern of balanced/imbalanced parentheses.
With the right prompt, I get the following results for a few examples I tried (first attempts, no cherry-picking).
Input: ( ) ( ( ) )
Input: ( ) ( ( )
Input: ) (
Input: ( ) ( ) ( )
So it is definitely able to learn balancing a small number of parentheses.
It is not obvious to me from reading that transcript (and the attendant commentary) that GPT-3 was even checking to see whether or not the parentheses were balanced. Nor that it "knows" (or has in any way encoded the idea) that the sequence of parentheses between the quotes contains all the information needed to decide between balanced versus unbalanced, and thus every instance of the same parentheses sequence will have the same answer for whether or not it is balanced.
I suppose one way to test this (especially if OpenAI can provide the same random seed as was used here and make this reproducible) would be to have input prompts written from John's perspective asking the teacher questions as if trying to understand the lesson. If GPT-3 is just "play-acting" based on the expected level of understanding of the character speaking, I would expect it to exhibit a higher level of accuracy/comprehension (on average, over many iterations) when writing from the perspective of the teacher rather than the student.
It seems to just do really bad with parentheses on their own. It can fix them with like... f(f(f(x))) but not '((())' type situations (I'm just using the beta).
Interesting. I'd think this might be due to the BPE issue Gwern has written about. But it looked like Arthur added spaces in order to get around that, and you've got spaces in some of your examples too, so that's not a full explanation.
EDIT: Oh, lol, he cites you in that section:
I don't think it's a BPE issue but not sure. I'd guess it's closer to the parity issue. It has a hard time implicitly counting in general.
edit: thanks, i know how to link now.
If you select some text that you want to be a link, then a little UI element should pop up that allows you to paste in the URL.
Here's a possible way to prove pretending-to-be-stupid maybe: we could try to prompt it in such a way that all its answers (to true/false questions) are wrong, much more often than chance. If it's able to do that, then we can ask, how? One possibility is: It is implicitly figuring out the truth and then saying the opposite. If we're careful, maybe we can set things up such that that's the only possibility.
(I'm not convinced that such a demo would really teach us anything that wasn't obvious, but I dunno.)
I don't think this is that crazy of a request. Many of the other fields of machine learning have robust visualizations that hazard at what the AI is "thinking." I haven't seen an equivalent thing for Transformer-based NLPs, but why not?
I would bet 3:1 that prefixing the first question with a question that John answers correctly (without necessarily having anything to do with brackets or balancing, just to show his competence) increases the probability that John will answer the first question correctly - except that this statement is easy to check without cooperation by OpenAI.
Of course, I also hope that OpenAI logs and analyzes everything.