Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Most of my programmer friends believe that Language Models trained on code will not affect their day job anytime soon. In this post, I make the case that 1) code generation is already useful (assuming minimal prompt engineering skills) 2) even if you do not believe in 1), code generation will increase programmers' throughput way sooner than it will fully automate them.

Language Models trained on Code do not bring us closer to Full Code Automation

This misconception comes from thinking linearly instead of exponentially. Language models are good enough at generating code to make the very engineers building such models slightly more productive, for instance when dealing with a new API. In other words, the returns (aka the improvements in the algorithm) from investing more resources in code generation directly helps (with better developer tools) create a better code-generating algorithm.

Code generation does not automate the part of my workday where I think hard

  • It still accelerates “glue code” or “API work”—a substantial fraction of large codebases.
  • Besides, only a set of privileged engineers get to think about the broad picture every day.
  • Plus, hard thinking is mostly required at the start, when designing the architecture.
  • And thinking seldom happens in a silo. It instead requires many iterations, through coding.

I asked a model to generate code but it doesn't seem to be able to solve it

More often than not, the issue is not about the model. Try another prompt. (Example)

The output is outdated code from average programmers

Code quality (length, variable naming, taste) is prompt and hyperparameter dependent. Generally, language models use variables from the prompt and you can rename those yourself.

Only developers who repeat the same tasks will be automated so it will not affect me

You might still see gains in productivity in learning how to use a more advanced version.

My job does not involve solving simple coding tests from docstrings

You should be capable of separating your code in smaller functions and write docstrings.

Codex cannot solve my problem since it has only access to a limited training set

Github Copilot stores your data. Supposedly, the same applies to the Codex beta.

Current Language Models still make silly mistakes

If the mistake is silly, then fixing it is trivial.

Anyway, it is error prone so it cannot be used for critical software

It generates less error than I do when writing code for the first time.

I would strongly suggest applying to Github Copilot or OpenAI Codex access to check for yourself, avoiding cherry-picked examples on the internet (in good and in bad). Indeed, if you search online, you might run into outdated reviews, where it turns out that highlighted errors actually work now. If you cannot wait for beta access, I recommend asking a friend for a demo (I'm happy to showcase it to everyone), trying genji python or reading this up-to-date review.

More generally, programmers should seriously consider learning prompt engineering to avoid being left behind, and, I believe, any future forecast about AI progress should include this shorter loop between deep learning models and programmer productivity.

New Comment
24 comments, sorted by Click to highlight new comments since:

[Note: I use Copilot and like it. The 'aha' moment for me was when I needed to calculate the intersection of two lines, a thing that I would normally just copy/paste from Stack Overflow, and instead Copilot wrote the function for me. Of course I then wrote tests and it passed the tests, which seemed like an altogether better workflow.]

Language models are good enough at generating code to make the very engineers building such models slightly more productive

How much of this is 'quality of code' vs. 'quality of data'? I would naively expect that the sort of algorithmic improvements generated from OpenAI engineers using Copilot/Codex/etc. are relatively low-impact compared to the sort of benefits you get from adding your company's codebase to the corpus (or whatever is actually the appropriate version of that). I'm somewhat pessimistic about the benefits of adding Copilot-generated code to the corpus as a method of improving Copilot.

I buy that "generated code" will not add anything to the training set, and that Copilot doesn't help for having good data or (directly) better algorithms. However, the feedback loop I am pointing at is when you accept suggestions on Copilot. I think it is learning from human feedback on what solutions people select. If the model is "finetuned" to the specific dev's coding style, I would expect Codex to suggest even better code (because of high quality of finetuning data) to someone at OAI than me or you.

How much of this is 'quality of code' vs. 'quality of data'?

I'm pointing at overall gains in dev's productivity. This could be used for collecting more data, which, AFAIK, happens by collecting automatically data from the internet using code (although possibly the business collaboration between OAI and github helped). Most of the dev work would then be iteratively cleaning that data, running trainings, changing the architecture, etc. before getting to the performance they'd want, and those cycles would be a tiny bit faster using such tools.

To be clear, I'm not saying that talented engineers are coding much faster today. They're probably doing creative work at the edge of what Codex has seen. However, we're using the first version of something that, down the line, might end up giving us decent speed increases (I've been increasingly more productive the more I've learned how to use it). A company owning such model would certainly have private access to better versions to use internally, and there are some strategic considerations in not sharing the next version of its code generating model to win a race, while collecting feedback from millions of developers.


What's your take on the licensing issue?  I know for sure that Codex won't affect my day job in the near term, because I'm not allowed to use it; I guess most large companies, and open-source projects large enough to care about IP assignment, will have the same problem.

This indirectly influences the speed of model improvement we should expect: the market for this kind of tool is smaller than you might think from the size of github's userbase, so there'll be less incentive to invest in it.

Wait, they did plain forbid you to use at all during work time, or they forbid to use its outputs for IT issues? Surely, using Codex for inspiration, given a natural language prompt and looking at what function it calls does not seem to infringe any copyright rules?

  • 1) If you start with your own variable names, it would auto-complete with those, maybe using something he learned online. would that count as plagiarism in your sense? How would that differ from copy-pasting from stack overflow changing the variable names (I'm not an expert in SO copyright terms but you should probably quote SO if doing so and there might be some rules about distributing it commercially).
  • 2) imagine you are using line-by-line auto-complete, and sometimes you re-arrange the ordering of the lines, adding your own code, even modifying it a bit. At one point does it become your own code?
  • 3) In the cases 1. and 2. that I mentioned above, even if some of the outputs were verbatim (which apparently happens a tiny fraction of the time) and had exactly the same (probably conventional) variable names, would "I have some line of code with exact the same normal naming of variables on the internet" be enough for going to court?
  • 4) Assuming that developers are, or will be, more productive using such tools, don't you think they would still use Copilot-like software to a) get inspiration b) copy-paste code that they would later modify to bypass IP infringements if they are smart enough about it, even though their companies "forbids" them from using it?

There is a new study out that found that 40% of Copilot's code contributions in high-risk scenarios were vulnerable:

I'm extremely keen to hear from people who have used Codex a decent amount (or tried to) and decided it isn't worth it. Specifically, people who wouldn't pay $15/mo for a subscription to it. Anyone?

For context, GitHub has 60,000,000 users. If 10% of them buy a $15/mo subscription, that's a billion dollars a year in annual revenue. A billion dollars is about a thousand times more than the cost to create Codex. (The cost to train the model was negligible since it's only the 12B param version of GPT-3 fine-tuned. The main cost would be the salaries of the engineers involved, I imagine.)

There is no possible way that 10% of GitHub's entire user base (mostly free) will pay $15/mo, which is more than GitHub's standard plan (team, $4/mo), and only slightly less than their most expensive plan (enterprise, $21/mo).  

A few tens of thousands of early adopters will probably do so, but tiered pricing will happen long before it becomes popular.  I predict there will be some use cases that justify $15/month, but the vast majority will be paid less, and charged by the resulting lines of code, the size/quantity of the prompts used, and/or the time consumed.

Thanks! Have you used Codex?

What are the main benefits people seek when they buy the more expensive plans? I don't understand the stuff on the page, but it looks like it's storage space + more features that make it easier to work in teams. I'm not sure how to compare that stuff to Codex but intuitively I feel like Codex is more valuable, because more people could benefit from Codex than are working in teams. I don't know what I'm talking about though, which is why I'm asking. :)

If the charge is per token... let me think... suppose Codex gets called up to write something 10 times per programmer work-hour (it would come in clumps probably, not evenly spaced. Sometimes it would not give you what you want and you'd retry a couple times). That's maybe 1000 tokens per work-hour, which (if it were GPT-3) would cost $0.06, so that's like $0.50 a day, which comes out to $15.00 a month... I swear I didn't plan that calculation to come out that way! (But of course it's just a fermi estimate, could be off by orders of magnitude. Also, the current version of Codex is the 12B param version which probably costs an OOM less than GPT-3)

I've seen demos, but have not gotten direct access myself yet (and I'll gladly pay that to evaluate, and long-term if I end up actually integrating it into my workflow).  Agreed that Codex is valuable on different dimensions than GitHub's current pricing model - for many, it will in fact be more valuable.  I mostly pointed out the discrepancy to counter the argument that number of current GitHub users predicts anything about who will pay what amount for Codex.

I think that many many coders have sporadic use, and $0.50/day for days they use it ends up being a lot less than $15/month.  My prediction is really that it will provide such widely varying value to different consumers that it'll be near-impossible to charge the same amount to all of them.

Maybe I'm wrong, but my first reaction to your initial number is that users doesn't mean active users. I would expect a difference of an order of magnitude, which keeps your conclusion but just with a hundred times more instead of a thousand times more.

That's reasonable. OTOH if Codex is as useful as some people say it is, it won't just be 10% of active users buying subscriptions and/or subscriptions might cost more than $15/mo, and/or people who aren't active on GitHub might also buy subscriptions.

Agreed. Part of the difficulty here is that you want to find who will buy a subscription and keep it. I expect a lot of people to try it, and most of them to drop it (either because they don't like it or because it doesn't help them enough for their taste) but no idea how to Fermi estimate that number.

Regarding your first point, I think when people say that language models "don't bring us closer to full code automation" they mean there's no way of improving/upgrading language models such that they implement full code automation. I think it would be better to argue against that claim directly instead of bringing up language model's productivity-boosting effects. There are many things that could potentially boost programmers' productivity -- better nootropics, say -- but it seems overly broad to say that they all "bring us closer to full code automation", even if it might be causally true that they reduce the time to automation in expectation.

The problem with arguing against that claim is that nobody knows whether transformers/scaling language models are sufficient for full code automation. To take your nootropics example, an analogy would be if nootropics were legal, did not have negative side effects, with a single company giving "beta access" (for now) to a new nootropic in unlimited amount at no cost to a market of tens of millions of users, that the data from using this nootropic was collected by the company to improve the product, that there actually were 100k peer-reviewed publications per year in the field of nootropics, where most of the innovation behind the tech came from a >100B-parameters model trained on open-source nootropic chemistry instructions. Would such advancements be evidence for something major we're not certain about (e.g. high bandwidth brain computer interface) or just evidence for increased productivity that would be reinjected into more nootropic investments?

I think those advancements could be evidence for both, depending on the details of how the nootropics work, etc. But it still seems worth distinguishing the two things conceptually. My objection in both cases is that only a small part of the evidence for the first comes from the causal impact of the second: i.e. if Codex gave crazy huge productivity improvements, I would consider that evidence for full code automation coming soon, but that's mostly because it suggests that Codex can likely be improved to the point of FCA, not because it will make OpenAI's progammers more productive.

I also used to think it would be useful for API/glue code and [this]( article persuaded me otherwise. The core of his argument is:

Most time coding is not taken up in writing code, but with designing, debugging, and maintaining code. When code is automatically generated, it’s easy to end up with a lot more of it....As a rule of thumb, less code means less to maintain and understand. Copilot’s code is verbose, and it’s so easy to generate lots of it that you’re likely to end up with a lot of code!

I can imagine code generation being useful for a solo project or prototype but it's hard to imagine it being useful for code that has to be maintained over time, at least until we have AGI. 

The fastai blog is linked in my post (it's the url for "outdated") since I tried some of the prompts from his blog (especially the first one when reading a file) and ended up with different results. It's worth mentioning that he only talks about Copilot, not Codex, the latter being supposedly from a more advanced model.

On the amount of code generated, you could make the similar argument for Stack Overflow. If I were a SO skeptic I would say "back in my day people used to read manuals and use the right options for functions, now that they just copy-paste many paragraphs of code". Codex is just SO on steroids, it's the engineers' responsibility to refactor, although I agree having auto-complete doesn't help solve bad habits.


Thinking about it more there's another, more serious restriction, at least for now: Codex can't write code that depends on the rest of your codebase.  Consider the following lightly-anonymized code from a real-world codebase I contribute to:

def unobserved(self) -> Set[str]:

    """Returns a set of all unobserved random variable names inside this Distribution -- that is,

those that are neither observed nor marginalized over.


    return set(self.components) - self.observed - self.marginalized

I don't think Codex could write that code from just that function signature and that docstring, because a human couldn't do it: they wouldn't know how to find the names of the observed and marginalized random variables, and they wouldn't know that self.components exists or has to be explicitly converted into a set.  And if the human didn't know what random variables are, or what marginalization is, they'd have an even tougher time.

One can imagine a prompt that might elicit this implementation, something like "this is a Python function that returns the elements of set(self.components) that do not appear in self.observed or self.marginalized", but then we are not really writing docstrings anymore, we are writing programs in a novel language with a stochastic compiler.

This should be fixable with larger context windows, I think.  If the prompt for a method could include the whole class definition aside from that method, Codex could at least in principle use other class methods and variables in sensible ways.  But this will have to wait on the practical realization of O(n) or O(nlogn) context window scaling.

I created a class initializing the attributes you mentioned, and when adding your docstring to your function signature it gave me exactly the answer you were looking for. Note that it was all in first try, and that I did not think at all about the initialization for components, marginalized or observed—I simply auto-completed.

class Distribution:
def __init__(self):
self.components = []
self.marginalized = None
self.observed = None

def unobserved(self) -> Set[str]:

"""Returns a set of all unobserved random variable names inside this Distribution -- that is,

those that are neither observed nor marginalized over.

return set(self.components) - set(self.observed) - set(self.marginalized)

Ok, but that isn't what the code actually looks like; the real Distribution.__init__ is much longer.  If the idea is that I could rewrite it into a form that fits in Codex's context window, and then Codex could help me, then to me that's even less compelling than the "natural language stochastic compiler" approach.

My objection is that usually I want to write code that depends on other code that's not widely repeated on Github and doesn't fit in the prompt.  Saying "but what if it does fit in the prompt?" isn't really responsive.

1. if you want a longer init, write a doctring for it

natural language stochastic compiler

I don't get what you mean here. I'm also not an expert on the Codex' "Context windows".

1) in my experience, even if not specified in your prompt, the model still goes or your depency graph (in different files in your repo, not Github) and picks which functions are relevant for the next line 2) if you know which function to. use, then add these function or "API calls" in the docstring;

A "compiler" is anything that translates a program from one representation to another.  Usually this translation is from a high-level language (like Java) to a lower-level language (like JVM bytecode), but you can also have e.g. a Python -> Javascript compiler that takes in Python code and produces Javascript.  A "natural language compiler", then, is one that takes in ordinary English-or-whatever sentences and emits something executable.  I think this is a pretty fair way to talk about Codex: it's the world's first natural language compiler that's any good at all.  I call it "stochastic" because its output is not consistent: if you give Codex the same input twice, you won't necessarily get the same result.

So if I'm writing Python with Codex, let's say, then whenever I start to write a function I have a choice of implementation languages: Python, or Codex prompt.  Codex is helpful when the prompt is easier to write than the Python implementation would have been.  The question isn't just "can I write a prompt that will elicit good-enough Python?", it's "is writing the prompt easier, or harder, than writing the Python?"  This is why I am not excited by very low-level prompts like "takes the difference between the set s and the set t"; once you understand your problem well enough to write a prompt like that, writing the actual code is only hard if you don't know Python well.  And if you don't know Python well, using Codex to generate it is a little risky, since then you won't be able to catch its silly mistakes.

I will write about the context window thing too, since I think it's close to the heart of our disagreement, but for now I'm out of time.

Thanks for natural language stochastic compiler explanation, makes a lot of sense. I broadly get a sense of what you mean by "context window" since people have been mentioning that quite a lot when talking about GPT-3. As for whether it makes sense to write docstrings for trivial things, I think this is only pointing at the Codex demo examples where people write docstrings and get results, but for most of my use cases, and when it gets really interesting, is when it auto-completes 1) while I'm writing 2) when I'm done writing and it guesses the next line 3) when I start a line by "return " or "x = " and wait for his auto-completion. Here, I would have no idea how to formulate it in the docstring, I just generally trust its ability to follow the logic of the code that precedes it (and I find it useful most of the time).