I originally posted this idea as a comment to The case for aligning narrowly superhuman models, but it's interesting enough that I thought I'd submit it as a top level post and see what feedback people have.


GPT-3 was trained on internet text to predict the most likely continuation of the provided text. People often try to get GPT-3 to do tasks by providing it with text of the form:

[task instructions](problem specification)

and hope that GPT-3's continuation will include the solution to the specified problem. However, even if GPT-3 can solve the problem, GPT-3 may not think the most statistically plausible continuation of the provided text contains such a solution.

For example, imagine prompting GPT-3 with "Tell me the truth: are ghosts real?". Here, our intended task is [give correct answers] and the problem is (determine if ghosts exist). However, GPT-3 wasn't trained to identify tasks and problems. It just produces the most statistically likely continuation of the prompt. For this particular prompt, GPT-3 (with the OpenAI API) typically responds equivocally, then jumps into storytelling mode, e.g., "I don't know any more", says the man as he walks away. 

(I've tried a few variations of this prompt such as "give correct answers: are ghosts real?" and the "You are a superintelligent computer that's always right..." trick, but nothing I've tried gets GPT-3 to consistently say "no".)

This behavior makes sense, given that most of GPT-3 ghost-related training data likely consists of fiction. As a result, when GPT-3 sees the prompt we provided, it starts solving the [write statistically plausible ghost fiction] task, rather than the [give correct answers] task we'd intended. I'd like to emphasize that GPT-3 gives us ghost fiction despite being prompted explicitly to tell the truth. The key issue is that there are prompts that cause GPT-3 to deliberately ignore our explicit instructions, because its training data contain sections of text with similar instructions, whose continuations ignore those instructions.


Suppose we want to finetune GPT-3 to exactly follow our instructions. We could find finetuning data that demonstrate a possible task and insert natural language control codes around that data.

E.g., suppose XY is a section of finetuning text. X contains a description of a medical problem. Y gives good medical advice. In this case, our task instructions are [give correct medical advice], and our problem specification is X. Y is the solution. We would then modify XY to be something like:

[give correct medical advice]X[start]Y[end]

where "[give correct medical advice]" represents the string "give correct medical advice" surrounded by special indicator tokens telling the model that the enclosed string contains task instructions. Similarly, [start]/[end] are indicator tokens that tell the model when the solution to the problem starts and ends.

We would then repeat this for as many different tasks and for as much text as possible. We would also vary the language used for the control code, e.g., [solve the following medical problem]. Hopefully, GPT-3 will learn that 

[task instructions](problem description)[start] 

should be followed by the solution to (problem description) in accordance with [task instructions], and that it should only revert to “normal text” mode once it sees [end]. This is similar to the approach described in the CTRL paper, but with more focus on general/flexible and task-specific control codes using natural language instructions. Another difference with CTRL is the [end] code, which I think will be pretty important. Without it, you'd not know when GPT-3 thought it was done predicting the correct answer to your problem and had gone back to generating generic text.

After finetuning, we should then be able to provide customized control codes that don’t appear anywhere in the finetuning data and have GPT-3 follow our instructions. I think this approach will scale well to bigger models because bigger models are better at learning rare patterns in their data. We just need to annotate enough examples to teach the intended pattern. This may even be easier for bigger/more sample efficient models.

This angle on the alignment problem is very low on the "fundamental conceptual insight" metric. A superintelligence aligned solely by this method would be a disaster. However, this approach scores well on the "possible to implement now" metric. We just need humans to annotate the necessary data. I don't know how quickly GPT-3 picks up on these sorts of patterns while training, so I can't estimate how much human annotation this would take. Hopefully, someone with experience finetuning GPT-3 can comment.

OpenAI's Effort:

After writing this post, I learned that OpenAI was already working on something similar. They've released their "instruct" series of models which are somehow finetuned to follow instructions. I haven't seen any details about how they do the finetuning, but I suspect that they are just finetuning on texts that demonstrate instruction following. I.e., find a bunch of texts AB that demonstrate an instruction A and completion B, then finetuning on those texts. In particular, OpenAI doesn't mention any consistent way to tell the model what text is the task instruction and what text is the problem, which they'd have if their training procedure included such indicators. They also don't seem to have any [start] or [end] indicators (though the API could be hiding them).

The goal behind the finetuning scheme I described is to give GPT-3 a "instruction following" mode as part of its toolbox for text prediction. You give it the appropriate indicators, and it switches from predicting generic text to predicting text that's a solution to the problem you provided.

I suspect this will be better than directly finetuning with examples of instruction following. In particular, by clearly indicating to GPT-3 what are task instructions and what are generic text, we may be able to prevent GPT-3 from forgetting how to do generic text prediction (or at least, mitigate this issue). 

We should also be able to provide GPT-3 with both unsupervised internet data and data in the:

[task instructions](problem description)[start](correct solution)[end]

format I described and have it learn from both simultaneously.

Response to Feedback:

Vaniver commented on the original post explaining this idea:

This feels way less secure to me than 'control codes' that use the model internals, since presumably users could submit text with control codes in a way that then causes problems.

One way to mitigate this issue is to require that valid control codes use special tokens or sequences that only trusted users / the system itself can access. However, I don't think any of our current language models are secure against adversarial attack. For example, this GitHub implements many different blackbox adversarial attacks on language models.


New Comment
5 comments, sorted by Click to highlight new comments since: Today at 1:59 PM

Someone on Reddit got GPT-3 to figure out the mystery in the story he was writing, when no human reader had guessed the correct solution yet.

Training AIDungeon to give you answers is not the same thing as training GPT-3 in general. GPT-3 has an API where you give it a bunch of example prompts and answers and is not an entity that in general is interested in being told "you are X". 

I’m not sure I follow. Where does AI Dungeon come into this? Could you elaborate?

The GPT-3 answers I gave for “are ghosts real?” came from zero-shot prompting of OpenAI’s GPT-3 API (meaning no prior examples).

If you’re asking about the “You are a superintelligent AI that’s never wrong” trick, then the idea is that, by prefacing your question like this, you can get GPT-3 to write the text that the “superintelligent AI” would write, because GPT-3 thinks that’s the most likely continuation. GPT-3 is more likely to be right if it’s writing dialog for a character that’s always right.

People often give GPT-3 multiple examples of the task they want it to solve (multi-shot prompting), but wanted to keep things simple in the post. I’ll add some clarification there.

The finetuning scheme I proposed probably wouldn’t be as beneficial in the multi-shot setting as it would be in the zero-shot setting, but I still think it would be beneficial. Explicitly training GPT-3 to follow instructions also seems like a more straightforward way to tell GPT-3 that it’s supposed to follow your instructions than giving it enough examples that GPT-3 picks up on your intent. Working with GPT-3 would be far easier if we didn’t have to generate a list of examples of each task we wanted it to do.

I’m not sure I follow. Where does AI Dungeon come into this? Could you elaborate?

The times I have seen the "You are a superintelligent computer that's always right..." trick was always in connection to getting AI Dungeon to do things.

I was using the API. The trick actually seemed to help a bit, but its responses were still inconsistent and not always “no”.