[AN #164]: How well can language models write code?

Thanks again for these newsletters and summaries! I'm excited about the flagship paper.

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

Recall that GPT-3 has 96 layers and the biggest model used in this paper was smaller than GPT-3. Each pass through the network is therefore loosely equivalent to less than one second of subjective time, by comparison to the human brain which typically goes through something like 100 serial operations per second I think? Could be a lot more, I'm not sure. https://aiimpacts.org/rate-of-neuron-firing/#Maximum_neural_firing_rates

So, the relevant comparison should be: Give a human the same test. Show them some code and give them 1 second to respond with an answer (or the first token of an answer, and then 1 second for the second token, and so forth). See how well they do at predicting the code output. I predict that they'd also do poorly, probably <50% accuracy. In claim that this passage from the paper inadvertently supports my hypothesis:

Including test cases and natural language descriptions in the prompt lead to the highest overall performance—higher than using the code itself. Because the code unambiguously describes the semantics, whereas test cases do not, this suggests that models are in some sense not really “reading” the source code and using it to execute. Models trained on general text corpora may be better at inducing patterns from as few as two input-output examples than they are at predicting the execution of code.

Second comment: Speculation about scaling trends:

Extrapolating from Figure 3, it seems that an AI which can solve (via at least one sample) approximately 100% of coding tasks in this set, without even needing fine-tuning, will require +2 OOMs of parameters, which would probably cost about $5B to train when you factor in the extra data required but also the lower prices and algorithmic improvements since GPT-3. Being almost 2 OOMs bigger than GPT-3, it might be expected to cost $6 per 1000 tokens, which would make it pretty expensive to use (especially if you wanted to use it at full-strength where it makes multiple samples and then picks the best one) though I think it might still find an economic niche; you could have a system where first a smaller model attempts a solution and you only call up the big model if that fails, and then you keep generating samples till you get one that works so on average the number of samples you need to generate will be small, and only cost you multiple dollars for a the toughest few percentile of cases. Then this service could be used by well-paid programmers for whom the time savings are worth it.

Does this extrapolation/speculation seem right?

[-]Rohin Shah4yΩ450

First comment: I don't think their experiment about code execution is much evidence re "true understanding."

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

(Idk if you were trying to argue something else with the comparison, but I don't think it's clear that this is a reasonable comparison; there are tons of objections you could bring up. For example, humans have to work from pixels whereas the language model gets tokens, making its job much easier.)

Second comment: Speculation about scaling trends:

I didn't check the numbers, but that seems pretty reasonable. I think there's a question of whether it actually saves time in the current format -- it might be faster to simply write the program than to write down a clear natural language description of what you want along with test cases.

[-]Daniel Kokotajlo4yΩ050

I agree that humans would do poorly in the experiment you outline. I think this shows that, like the language model, humans-with-one-second do not "understand" the code.

Haha, good point -- yes. I guess what I should say is: Since humans would have performed just as poorly on this experiment, it doesn't count as evidence that e.g. "current methods are fundamentally limited" or "artificial neural nets can't truly understand concepts in the ways humans can" or "what goes on inside ANN's is fundamentally a different kind of cognition from what goes on inside biological neural nets" or whatnot.

[-]Rohin Shah4yΩ450

Oh yeah, I definitely agree that this is not strong evidence for typical skeptic positions (and I'd guess the authors would agree).

[-]tin4824y30

See also "Evaluating Large Language Models Trained on Code", OpenAI's contribution. They show progress on the APPS dataset (Intro: 25% pass, Comp: 3% pass @ 1000 samples), though note there was substantial overlap with the training set. They also only benchmark up to 12 billion params, but have also trained a related code-optimized model at GPT-3 scale (~100 billion).

Notice that technical details are having a large impact here:

GPT-3 saw a relatively small amount of code, only what was coincidentally in the dataset, and does poorly
GPT-J had Github as a substantial fraction of its training set
The dataset for Google's 137-billion model is not public but apparently "somewhat oversampled web pages that contain code". They also try fine-tuning on a very small dataset (374 items).
Codex takes a pre-trained GPT-3 model and fine-tunes on 159 GB of code from Github. They also do some light prompt engineering. Overall, they show progress on APPS
OpenAI's largest model additionally uses a BPE tokenization optimized for code, and may have other differences. It has not yet been publicly benchmarked

[-]Rohin Shah4y20

Thanks, I probably should have linked to my summary of that paper in this newsletter.

[-]Rohin Shah4yΩ220

I've heard rumors that people are interpreting the highlighted papers as "huh, large models aren't that good at writing code, they don't even solve introductory problems". (Note that these are only rumors, I don't know of any specific people who take this interpretation.)

I don't buy this interpretation, because these papers didn't do the biggest, most obvious improvement: to actually train on a large dataset of code (i.e. Github), as in Codex. My reaction to these papers is more like “wow, even models trained on language are weirdly good at writing code, given they were trained to produce language, imagine how good they must be when trained on Github”.

LESSWRONG
LW

LESSWRONG
LW

13

[AN #164]: How well can language models write code?

13

Ω 9

13

Ω 9

SECTIONS

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

FIELD BUILDING

MISCELLANEOUS (ALIGNMENT)

NEWS

FEEDBACK

PODCAST