The Codex Skeptic FAQ

[-]Vaniver4yΩ12220

[Note: I use Copilot and like it. The 'aha' moment for me was when I needed to calculate the intersection of two lines, a thing that I would normally just copy/paste from Stack Overflow, and instead Copilot wrote the function for me. Of course I then wrote tests and it passed the tests, which seemed like an altogether better workflow.]

Language models are good enough at generating code to make the very engineers building such models slightly more productive

How much of this is 'quality of code' vs. 'quality of data'? I would naively expect that the sort of algorithmic improvements generated from OpenAI engineers using Copilot/Codex/etc. are relatively low-impact compared to the sort of benefits you get from adding your company's codebase to the corpus (or whatever is actually the appropriate version of that). I'm somewhat pessimistic about the benefits of adding Copilot-generated code to the corpus as a method of improving Copilot.

[-]Michaël Trazzi4yΩ480

I buy that "generated code" will not add anything to the training set, and that Copilot doesn't help for having good data or (directly) better algorithms. However, the feedback loop I am pointing at is when you accept suggestions on Copilot. I think it is learning from human feedback on what solutions people select. If the model is "finetuned" to the specific dev's coding style, I would expect Codex to suggest even better code (because of high quality of finetuning data) to someone at OAI than me or you.

How much of this is 'quality of code' vs. 'quality of data'?

I'm pointing at overall gains in dev's productivity. This could be used for collecting more data, which, AFAIK, happens by collecting automatically data from the internet using code (although possibly the business collaboration between OAI and github helped). Most of the dev work would then be iteratively cleaning that data, running trainings, changing the architecture, etc. before getting to the performance they'd want, and those cycles would be a tiny bit faster using such tools.

To be clear, I'm not saying that talented engineers are coding much faster today. They're probably doing creative work at the edge of what Codex has seen. However, we're using the first version of something that, down the line, might end up giving us decent speed increases (I've been increasingly more productive the more I've learned how to use it). A company owning such model would certainly have private access to better versions to use internally, and there are some strategic considerations in not sharing the next version of its code generating model to win a race, while collecting feedback from millions of developers.

[-]Taran4y*Ω7160

[-]Michaël Trazzi4yΩ460

Wait, they did plain forbid you to use at all during work time, or they forbid to use its outputs for IT issues? Surely, using Codex for inspiration, given a natural language prompt and looking at what function it calls does not seem to infringe any copyright rules?

1) If you start with your own variable names, it would auto-complete with those, maybe using something he learned online. would that count as plagiarism in your sense? How would that differ from copy-pasting from stack overflow changing the variable names (I'm not an expert in SO copyright terms but you should probably quote SO if doing so and there might be some rules about distributing it commercially).
2) imagine you are using line-by-line auto-complete, and sometimes you re-arrange the ordering of the lines, adding your own code, even modifying it a bit. At one point does it become your own code?
3) In the cases 1. and 2. that I mentioned above, even if some of the outputs were verbatim (which apparently happens a tiny fraction of the time) and had exactly the same (probably conventional) variable names, would "I have some line of code with exact the same normal naming of variables on the internet" be enough for going to court?
4) Assuming that developers are, or will be, more productive using such tools, don't you think they would still use Copilot-like software to a) get inspiration b) copy-paste code that they would later modify to bypass IP infringements if they are smart enough about it, even though their companies "forbids" them from using it?

[-]Lech Mazur4y140

There is a new study out that found that 40% of Copilot's code contributions in high-risk scenarios were vulnerable: https://arxiv.org/abs/2108.09293

[-]Daniel Kokotajlo4yΩ4130

I'm extremely keen to hear from people who have used Codex a decent amount (or tried to) and decided it isn't worth it. Specifically, people who wouldn't pay $15/mo for a subscription to it. Anyone?

[-]Daniel Kokotajlo4yΩ360

For context, GitHub has 60,000,000 users. If 10% of them buy a $15/mo subscription, that's a billion dollars a year in annual revenue. A billion dollars is about a thousand times more than the cost to create Codex. (The cost to train the model was negligible since it's only the 12B param version of GPT-3 fine-tuned. The main cost would be the salaries of the engineers involved, I imagine.)

[-]Dagon4y100

There is no possible way that 10% of GitHub's entire user base (mostly free) will pay $15/mo, which is more than GitHub's standard plan (team, $4/mo), and only slightly less than their most expensive plan (enterprise, $21/mo).

A few tens of thousands of early adopters will probably do so, but tiered pricing will happen long before it becomes popular. I predict there will be some use cases that justify $15/month, but the vast majority will be paid less, and charged by the resulting lines of code, the size/quantity of the prompts used, and/or the time consumed.

[-]Daniel Kokotajlo4y20

Thanks! Have you used Codex?

What are the main benefits people seek when they buy the more expensive plans? I don't understand the stuff on the page, but it looks like it's storage space + more features that make it easier to work in teams. I'm not sure how to compare that stuff to Codex but intuitively I feel like Codex is more valuable, because more people could benefit from Codex than are working in teams. I don't know what I'm talking about though, which is why I'm asking. :)

If the charge is per token... let me think... suppose Codex gets called up to write something 10 times per programmer work-hour (it would come in clumps probably, not evenly spaced. Sometimes it would not give you what you want and you'd retry a couple times). That's maybe 1000 tokens per work-hour, which (if it were GPT-3) would cost $0.06, so that's like $0.50 a day, which comes out to $15.00 a month... I swear I didn't plan that calculation to come out that way! (But of course it's just a fermi estimate, could be off by orders of magnitude. Also, the current version of Codex is the 12B param version which probably costs an OOM less than GPT-3)

[-]Dagon4y40

I've seen demos, but have not gotten direct access myself yet (and I'll gladly pay that to evaluate, and long-term if I end up actually integrating it into my workflow). Agreed that Codex is valuable on different dimensions than GitHub's current pricing model - for many, it will in fact be more valuable. I mostly pointed out the discrepancy to counter the argument that number of current GitHub users predicts anything about who will pay what amount for Codex.

I think that many many coders have sporadic use, and $0.50/day for days they use it ends up being a lot less than $15/month. My prediction is really that it will provide such widely varying value to different consumers that it'll be near-impossible to charge the same amount to all of them.

[-]adamShimi4yΩ340

Maybe I'm wrong, but my first reaction to your initial number is that users doesn't mean active users. I would expect a difference of an order of magnitude, which keeps your conclusion but just with a hundred times more instead of a thousand times more.

[-]Daniel Kokotajlo4yΩ340

That's reasonable. OTOH if Codex is as useful as some people say it is, it won't just be 10% of active users buying subscriptions and/or subscriptions might cost more than $15/mo, and/or people who aren't active on GitHub might also buy subscriptions.

[-]adamShimi4yΩ460

Agreed. Part of the difficulty here is that you want to find who will buy a subscription and keep it. I expect a lot of people to try it, and most of them to drop it (either because they don't like it or because it doesn't help them enough for their taste) but no idea how to Fermi estimate that number.

[-]interstice4yΩ120

Regarding your first point, I think when people say that language models "don't bring us closer to full code automation" they mean there's no way of improving/upgrading language models such that they implement full code automation. I think it would be better to argue against that claim directly instead of bringing up language model's productivity-boosting effects. There are many things that could potentially boost programmers' productivity -- better nootropics, say -- but it seems overly broad to say that they all "bring us closer to full code automation", even if it might be causally true that they reduce the time to automation in expectation.

[-]Michaël Trazzi4yΩ110

The problem with arguing against that claim is that nobody knows whether transformers/scaling language models are sufficient for full code automation. To take your nootropics example, an analogy would be if nootropics were legal, did not have negative side effects, with a single company giving "beta access" (for now) to a new nootropic in unlimited amount at no cost to a market of tens of millions of users, that the data from using this nootropic was collected by the company to improve the product, that there actually were 100k peer-reviewed publications per year in the field of nootropics, where most of the innovation behind the tech came from a >100B-parameters model trained on open-source nootropic chemistry instructions. Would such advancements be evidence for something major we're not certain about (e.g. high bandwidth brain computer interface) or just evidence for increased productivity that would be reinjected into more nootropic investments?

[-]interstice4y10

I think those advancements could be evidence for both, depending on the details of how the nootropics work, etc. But it still seems worth distinguishing the two things conceptually. My objection in both cases is that only a small part of the evidence for the first comes from the causal impact of the second: i.e. if Codex gave crazy huge productivity improvements, I would consider that evidence for full code automation coming soon, but that's mostly because it suggests that Codex can likely be improved to the point of FCA, not because it will make OpenAI's progammers more productive.

[-]Will Sorenson4y20

I also used to think it would be useful for API/glue code and [this](https://www.fast.ai/2021/07/19/copilot/) article persuaded me otherwise. The core of his argument is:

Most time coding is not taken up in writing code, but with designing, debugging, and maintaining code. When code is automatically generated, it’s easy to end up with a lot more of it....As a rule of thumb, less code means less to maintain and understand. Copilot’s code is verbose, and it’s so easy to generate lots of it that you’re likely to end up with a lot of code!

I can imagine code generation being useful for a solo project or prototype but it's hard to imagine it being useful for code that has to be maintained over time, at least until we have AGI.

[-]Michaël Trazzi4y10

The fastai blog is linked in my post (it's the url for "outdated") since I tried some of the prompts from his blog (especially the first one when reading a file) and ended up with different results. It's worth mentioning that he only talks about Copilot, not Codex, the latter being supposedly from a more advanced model.

On the amount of code generated, you could make the similar argument for Stack Overflow. If I were a SO skeptic I would say "back in my day people used to read manuals and use the right options for functions, now that they just copy-paste many paragraphs of code". Codex is just SO on steroids, it's the engineers' responsibility to refactor, although I agree having auto-complete doesn't help solve bad habits.

[-]Taran4y*Ω010

[-]Michaël Trazzi4yΩ270

I created a class initializing the attributes you mentioned, and when adding your docstring to your function signature it gave me exactly the answer you were looking for. Note that it was all in first try, and that I did not think at all about the initialization for components, marginalized or observed—I simply auto-completed.

class Distribution:
    def __init__(self):
        self.components = []
        self.marginalized = None
        self.observed = None


def unobserved(self) -> Set[str]:

    """Returns a set of all unobserved random variable names inside this Distribution -- that is,

those that are neither observed nor marginalized over.

    """
    return set(self.components) - set(self.observed) - set(self.marginalized)

[-]Taran4y*10

[-]Michaël Trazzi4y10

1. if you want a longer init, write a doctring for it

natural language stochastic compiler

I don't get what you mean here. I'm also not an expert on the Codex' "Context windows".

1) in my experience, even if not specified in your prompt, the model still goes or your depency graph (in different files in your repo, not Github) and picks which functions are relevant for the next line 2) if you know which function to. use, then add these function or "API calls" in the docstring;

[-]Taran4y*10

[-]Michaël Trazzi4y10

Thanks for natural language stochastic compiler explanation, makes a lot of sense. I broadly get a sense of what you mean by "context window" since people have been mentioning that quite a lot when talking about GPT-3. As for whether it makes sense to write docstrings for trivial things, I think this is only pointing at the Codex demo examples where people write docstrings and get results, but for most of my use cases, and when it gets really interesting, is when it auto-completes 1) while I'm writing 2) when I'm done writing and it guesses the next line 3) when I start a line by "return " or "x = " and wait for his auto-completion. Here, I would have no idea how to formulate it in the docstring, I just generally trust its ability to follow the logic of the code that precedes it (and I find it useful most of the time).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

49

The Codex Skeptic FAQ

49

Ω 13

49

Ω 13