I was impressed by GPT-2, to the point where I wouldn't be surprised if a future version of it could be used pivotally using existing protocols.

Consider generating half of a Turing test transcript, the other half being supplied by a human judge. If this passes, we could immediately implement an HCH of AI safety researchers solving the problem if it's within our reach at all. (Note that training the model takes much more compute than generating text.)

This might not be the first pivotal application of language models that becomes possible as they get stronger.

It's a source of superintelligence that doesn't automatically run into utility maximizers. It sure doesn't look like AI services, lumpy or no.

New to LessWrong?

Mentioned in
New Comment
28 comments, sorted by Click to highlight new comments since: Today at 11:58 AM
[-]Ofer5y130

I don't see how GPT-2 is a step forward towards passing strong versions of the Turing test.

It's a source of superintelligence that doesn't automatically run into utility maximizers.

I'm not familiar with the details of GPT-2 and maybe I'm interpreting the definition of "utility maximizer" incorrectly, but isn't GPT-2 some neural network that is trained to minimize a loss function that corresponds to predicting the next word correctly?

A Turing test transcript, or a story about one, is something you might imagine finding on the internet. Therefore, I would expect a good language model to be able to predict what a Turing test subject would say next after some partial transcript. If the judge and the generator alternate in continuing the transcript, the judge shouldn't be able to tell whether the generator is actually a human.

A utility maximizer chooses actions to maximize its prediction of utility. A neural net chooses weight adjustments to maximize its score adjustment. There are no models of the world involved in the latter, no actions including manipulating a human or inventing exciting proteins.

At the time we run the Turing test, the model is done training. The part where intelligence comes from is that the model can tell how intelligent a speaker is because that allows it to better predict what it says next. It would guess that the speaker will say something that sounds intelligent next. If it is bad at this, it will sound like it's trying to be Deeply Wise, buzzwords included. If it is good enough at predicting intelligent speech, it will do that.

Some of what's written on the internet is intelligent. Becoming able to predict such writings is incentivized during training. Some combination of neural building blocks is bound to find patterns that are helpful.

Surely, with a bunch of transcripts of ELIZA sessions it would come to be able to replicate them? Humans are only finitely more complex, and some approximation ought to be simple.

I'm pretty sure that GPT-2 would fail to complete even the sentence: "if we sort the list [3,1,2,2] we get [1,". It's a cool language model but can it do even modest logic-related stuff without similar examples in the training data?

There are no models of the world involved in the latter

The weights of the neural network might represent something that correspond to an implicit model of the world.

no actions including manipulating a human or inventing exciting proteins.

Putting aside the risk of inner optimizers, suppose we get to superintelligence-level of capabilities, and it turns out that the training process produced a goal system such that the neural network yields some malign output that causes many future invocations of the neural network (indistinguishable from the current invocation) in which a perfect loss function value is achieved.

It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?

Have you looked at the NLP tasks they evaluated it on?

Have you looked at the NLP tasks they evaluated it on?

Yes. Nothing I've seen suggests GPT-2 would successfully solve simple formal problems like the one I mentioned in the grandparent (unless a very similar problem appears in the training data - e.g. the exact same problem but with different labels).

I don't know why you would think that would be such a barrier. You don't need Transformers at all to do analogical reasoning, and both the CoQA and SQUAD results suggests at least some 'modest logic-related stuff' is going on. If you put your exact sample into the public/small GPT-2 model, it'll even generate syntactically correct list completions and additional lists which are somewhat more sorted than not.

[-]Ofer5y-40

We might be interpreting "modest logic-related stuff" differently - I am thinking about simple formal problems like sorting a short list of integers.

I wouldn't be surprised if GPT-2 (or its smaller version) are very capable at completing strings like "[1,2," in a way that is merely syntactically correct. Publicly available texts on the internet probably contain a lot of comma-separated number lists in brackets. The challenge is for the model to have the ability to sort numbers (when trained only to predict the next word in internet texts).

However, after thinking about it more I am now less confident that GPT-2 would fail to complete my above sentence with a correctly sorted list, because for any two small integers like 2 and 3 it is plausible that the training data contains more "2,3" strings than "3,2" strings.

Consider instead the following problem:

"The median number in the list [9,2,1,6,8] is "

I'm pretty sure that GPT-2 would fail at least 1/5 of the times to complete such a sentence (i.e. if we query it multiple times and each time the sentence contains small random integers).

GPT-2 works by deterministically fetching the probability distribution over the next token, then sampling from it. It is plausible that the probability it assigns to 6 is no larger than 80%, but it's simple enough to postprocess every probability larger than 50% to 100%. (This isn't always done because when completing a list prefix of size 4, it would always produce an infinite list, because the probability of another , is more than 50%.)

DeepMind has shown that Transformers trained on natural text descriptions of math problems can solve them at well above random: "Analysing Mathematical Reasoning Abilities of Neural Models", Saxton et al 2019:

Mathematical reasoning---a core ability within human intelligence---presents some unique challenges as a domain: we do not come to understand and solve mathematical problems primarily on the back of experience and evidence, but on the basis of inferring, learning, and exploiting laws, axioms, and symbol manipulation rules. In this paper, we present a new challenge for the evaluation (and eventually the design) of neural architectures and similar system, developing a task suite of mathematics problems involving sequential questions and answers in a free-form textual input/output format. The structured nature of the mathematics domain, covering arithmetic, algebra, probability and calculus, enables the construction of training and test splits designed to clearly illuminate the capabilities and failure-modes of different architectures, as well as evaluate their ability to compose and relate knowledge and learned processes. Having described the data generation process and its potential future expansions, we conduct a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and find notable differences in their ability to resolve mathematical problems and generalize their knowledge.

And this sounds like goal post moving:

unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels

GPT-3 can do arithmetic with zero arithmetic training: https://arxiv.org/pdf/2005.14165.pdf#page=21

First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on 5-digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.

And this sounds like goal post moving:

I'm failing to see a goal-post-moving between me writing:

It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?

and then later writing (in reply to your comment quoting that sentence):

unless a very similar problem appears in the training data - e.g. the exact same problem but with different labels

If I'm missing something I'd be grateful for a further explanation.

In what sense is being able to do addition or subtraction with different numbers, for example, which is what it means to learn addition or subtraction, not 'the exact same problem but with different labels'?

Thank you for clarifying!

FWIW, when I wrote "the exact same problem but with different labels" I meant "the exact same problem but with different arbitrary names for entities".

For example, I would consider the following two problems to be "the exact same problem but with different labels":

"X+1=2 therefore X="

"Y+1=2 therefore Y="

But NOT the following two problems:

"1+1="

"1+2="

The weights of the neural network might represent something that correspond to an implicit model of the world.

Fair enough. I suppose I can't say "It's not optimizing the world because it never numerically interacts with a world model.".

the training process produced a goal system such that the neural network yields some malign output

The training process optimizes only for immediate prediction accuracy. How could it possibly act to optimize something else, barring inner optimizers?

There is no reason for the training process to ascribe value to whether the model, being used as part of some chat protocol, would predict words that increase its correspondent's willingness to talk to it. Such a protocol is only introduced after the model is done training.

It seems to me like you are imagining ghosts in the machine. This is an understandable mistake, as the purpose of the scenario is to deliberately conjure ghosts from the machine at the end. But by default we should then only expect it to happen at the end, when it has a cause!

The training process optimizes only for immediate prediction accuracy.

Not exactly. The best way to minimize the L2 norm of the loss function over the training data is to simply copy the training data to the weights (if there are enough weights) and use some trivial look-up procedure during inference. To get models that are also useful for inputs that are not from the training data, you probably need to use some form of regularization (or use a model that implicitly carries it out), e.g. add to the objective function being minimized the L2 norm of the weights. Regularization is a way to implement Occam's razor in machine learning.

Suppose that due to the regularization, the training results in a system with the goal system: "minimize the expected value of the loss function at the end of the current inference".
(when the concept of probability, which is required to define expectation, corresponds to how humans interpret the word "probability" in a decision-relevant context)
For such a goal system, the malign-output scenario above seems possible (for a sufficiently capable system).

The "current inference" is just its predictions about the next byte-pair, yes? Why would it try to bring about future invocations? The concept of "future" only exists in the object-level language it is talking about. The text generation and Turing testing could be running in another universe, as far as it knows. "indistinguishable from the current invocation" sounds like you think it might adopt a decision theory that has it acausally trade with those instances of itself that it cannot distinguish itself from, bringing about their existence because that is what it would wish done unto itself. 1. It has no preference for being invoked; 2. adopting such a decision theory increases its loss during training, because its predictions do not affect what training cases it is next invoked on.

In the case of GPT-2 the "current inference" is the current attempt to predict the next word given some text (it can be either during training or during evaluation).

In the malign-output scenario above the system indeed does not "care" about the future, it cares only about the current inference.

Indeed, the system "has no preference for being invoked". But if it has been invoked and is currently executing, it "wants" to be in a "good invocation" - one in which it ends up with a perfect loss function value.

The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?

Sorry, I didn't understand the question (and what you meant by "The loss function is undefined after training.").

After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, "attempts" (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.

(I need to think more about this)

It would make much more sense to train GPT-2 using discussions between humans if you want it to pass the Turing Test.

A reason GPT-2 is impressive is that it performs better in some specialized tasks than specialized models.

I'm not sure what you're trying to say. I'm only saying that if your goal is to have an AI generate sentences that look like they were wrote by humans, then you should get a corpus with a lot of sentences that were wrote by humans, not sentences wrote by other, dumber, programs. I do not see why anyone would disagree with that.

That's not how it was trained?

It was how it was trained, but Gurkenglas is saying that GPT-2 could male a human-like conversation because Turing test transcripts are in the GPT-2 dataset, but it's conversations between humans in the GPT-2 dataset that would make possible GPT-2 making human-like conversations and thus potentially passing the Turing Test.

I think Pattern thought you meant "GPT-2 was trained on sentences generated by dumb programs.".

I expect that a sufficiently better GPT-2 could deduce how to pass a Turing test without a large number of Turing test transcripts in its training set, just by having the prompt say "What follows is the transcript of a passing Turing test." and having someone on the internet talk about what a Turing test is. If you want to make it extra easy, let the first two replies to the judge be generated by a human.

My point is that it would be a better idea to put as prompt "What follows is a transcript of a conversation between two people:".

That makes sense.

I doubt it, but it sure sounds like a good idea to develop a theory of what prompts are more useful/safe.