The recent DeepMind paper will presumably generate a lot of discussion, and I'm not an expert enough to completely understand the technicalities. But I still wonder how much of the breakthrough could be reduced to "tune GPT-3 for code generation instead of general text prediction".

~~Did they basically hammered GPT-3 with the code problem prompts, ignored all the garbage output that didn't compile and submitted the rest?~~ (*Update: No they didn't; AlphaCode is not derived from GPT-3*). I mean, they admittedly discard >99% samples:

We only keep samples that pass the example tests provided in the problem description. These tests are usually pretty short and simple, but enough to weed out programs that are completely incorrect. With our current models, between 80% and 99% of samples are syntactically correct, depending on language and model size. However, only 0.4% - 0.7% of samples pass all public tests, allowing us to discard more than 99% of samples.

Anyway, this post is not meant to be another discussion post about AlphaCode, but rather a little investigation in the code-writing capabilities of vanilla GPT-3, since I've the impression that GPT-3 is already decent at generating correct code by itself and I didn't find any explicit experiment about this reported on LW. I've spent some minutes playing with the Eleuther UI (with default parameters) in order to generate some simple functions that would fit in its limited output size.

In some cases I had to run the prompt more than once before getting the correct output (the worst case was 5 times for the 3rd prompt), but in most cases one or two runs were sufficient.

**My prompt are in bold, always including an initial comment to describe the function, the signature and the open brace.**

### First prompt

*Just asking GPT-3 a function to sum a+b. It got a bit carried away returning also functions for a*b and a/b. Obtained at the first run.*

/* returns the sum of a and b */function sum (int a, int b){

int c;

c = a + b;

return c;

}/* returns the product of a and b */

function product (int a, int b)

{

int c;

c = a * b;

return c;

}/* returns the quotient of a by b */

function quotient (int a, int b)

{

int c;

c = a / b;

return c;

}

### Second prompt

*A little bit more complicated: find the maximum element of an array. Obtained at the fourth run.*

/* returns the maximum value in the array A */function maximumElement (A){

var i, j, max = A[0];

for (i = 1; i < A.length; i++)

{

if (A[i] > max)

max = A[i];

}

return max;

}

### Third prompt

*Can GPT-3 output recursive functions? Yes, it can. Also, it knows Fibonacci numbers. Obtained at the fifth run.*

/* returns the n-th Fibonacci number*/function Fibonacci (int n){

if (n < 2) {

return 0;

} else if (n == 2) {

return 1;

} else {

return Fibonacci(n - 1) + Fibonacci(n - 2);

}

}

### Fourth prompt

*Does it knows the Euclidean algorithm for finding the greatest common divisor? Of course. Obtained at the second run.*

/* returns the greatest common divisor of a and b */function GCD (int a, int b){

if (a == 0) {

if (b == 0) {

return 1;

} else {

return b;

}

}

if (b == 0) {

return a;

}

return GCD (b, a % b);

}

### Fifth prompt

*Ok, what if we ask for some stupid function nobody would really write? Obtained at the first run.*

/* prints "Funny!" if the array A contains the key */function funnySearch (int key, array A){

for (var i = 0; i < A.length; i++)

if (A[i] === key)

return "Funny!";

return "Not found";

}

### Sixth prompt

*What about prompting a function with a terrible signature? Not only GPT-3 correctly wrote it anyway, it even scolded me about it! Obtained at the second run.*

/* returns the number of nodes in the graph */function inappropriateFunctionName (G){

var nodes = [], i;

for (i in G)

if (G[i]!== undefined)

nodes.push(i);

return nodes.length;

}It's not an actual bug, but a poor naming choice that should probably be changed.

I'm confused about how you judged number six. A mapping from indexes to nodes would be a surprising way for the input graph to be stored.

Note that competitive programming tasks tend to be much harder than the prompts you gave.

I know, but I gave them to a text predictor not specifically tailored to write code and it wrote correct code anyway. For the first four prompts we might argue that it probably just copied code from the training data, but this seems quite unlikely for the last two. My rough non-expert intuition is that the shallow understanding of "write code" didn't really change that much from GPT-3 to AlphaCode, and the performance boost of the latter is essentially due to fine-tuning and filtering tricks.

Some things that I feel undermine your case: your sample size is fairly small here, and it would have been valuable if you tried sampling maybe 10-20 times for each. Also, these code snippets are either the kind of thing I'd expect would be in the dataset, or are trivial. Plus, GPT-3 wasn't used as a base model for AlphaCode, so it can't have been due to "fine-tuning and filtering tricks". Finally, GPT-3 is way bigger than any AlphaCode model.

I had missed this step. Retrospectively it should have been obvious... of course that you don't start from a huge text predictor model to build a code predictor model that only needs to predict compilable code. Thanks for the clarification.

I think the fact that GPT-3 is controlled by OpenAI and AlphaCode is a DeepMind project has more to do with it. Of course you don't

needto hotstart by transfer learning, but it's a good idea anyway if you can, which is why DM not using its own GPT-3-equivalent (Gopher, trained at considerable expense) has drawn comment.Fyi, you've linked to a post discussing a DeepMind paper.

Uh, you're right, I messed up with the actual post discussing an OpenAI paper. Fixed.