For what it's worth, the most relevant difficult-to-fall-prey-to-Goodheartian-tricks measure is probably cross entropy validation loss, as shown in this figure from the GPT-3 paper:
Serious scaling efforts are much more likely to emphasize progress here over Parameter Count Number Bigger clickbait.
Further, while this number will keep going down, we're going to crash into the entropy of human generated text at some point. Whether that's within 3 OOM or ten is anybody's guess, though.
By the standards of “we will have a general intelligence”, Moravec is wrong, but by the standards of “computers will be able to do anything humans can do”, Moravec’s timeline seems somewhat uncontroversially prescient? For essentially any task that we can define a measurable success metric, we more or less* know how to fashion a function approximator that’s as good as or better than a human.
*I’ll freely admit that this is moving the goalposts, but there’s a slow, boring path to “AGI” where we completely automate the pipeline for “generate a function approximator that is good at [task]”. The tasks that we don’t yet know how to do this for are increasingly occupying the narrow space of [requires simulating social dynamics of other humans], which, just on computational complexity grounds, may be significantly harder than [become superhuman at all narrowly defined tasks].
Relatedly, do you consider [function approximators for basically everything becoming better with time] to also fail to be a good predictor of AGI timelines for the same reasons that compute-based estimates fail?
In defense of shot-ness as a paradigm:
Shot-ness is a nice task-ambiguous interface for revealing capability that doesn’t require any cleverness from the prompt designer. Said another way, If you needed task-specific knowledge to construct the prompt that makes GPT-3 reveal it can do the task, it’s hard to compare “ability to do that task” in a task-agnostic way to other potential capabilities.
For a completely unrealistic example that hyperbolically gestures at what I mean: you could spend a tremendous amount of compute to come up with the magic password prompt that gets GPT-3 to reveal that it can prove P!=NP, but this is worthless if that prompt itself contains a proof that P!=NP, or worse, is harder to generate than the original proof.
This is not what if “feels like” when GPT-3 suddenly demonstrates it is able to do something, of course—it’s more like it just suddenly knows what you meant, and does it, without your hinting really seeming like it provided anything particularly clever-hans-y. So it’s not a great analogy. But I can’t help but feel that a “sufficiently intelligent” language model shouldn’t need to be cajoled into performing a task you can demonstrate to it, thus I personally don’t want to have to rely on cajoling.
Regardless, it’s important to keep track of both “can GPT-n be cajoled into this capability?” as well as “how hard is it to cajole GPT-n into demonstrating this capability?”. But I maintain that shot-prompting is one nice way of probing this while holding “cajoling-ness” relatively fixed.
This is of course moot if all you care about is demonstrating that GPT-n can do the thing. Of course you should prompt tune. Go bananas. But it makes a particular kind of principled comparison hard.
Edit: wanted to add, thank you tremendously for posting this—always appreciate your LLM takes, independent of how fully fleshed they might be
Honestly, at this point, I don’t remember if it’s inferred or primary-sourced. Edited the above for clarity.
This is based on:
This is not to say that GPT-4 wont have architectural changes. Sam mentioned a longer context at the least. But these sorts of architectural changes probably qualify as “small” in the parlance of the above conversation.
I believe Sam Altman implied they’re simply training a GPT-3-variant for significantly longer for “GPT-4”. The GPT-3 model in prod is nowhere near converged on its training data.
Edit: changed to be less certain, pretty sure this follows from public comments by Sam, but he has not said this exactly
OpenAI is still running evaluations.
This was frustrating to read.
There’s some crux hidden in this conversation regarding how much humanity’s odds depend on the level of technology (read: GDP) increase we’ll be able to achieve with pre-scary-AGI. It seems like Richard thinks we could be essentially post-scarcity, thus radically changing the geopolitical climate (and possibly making collaboration on an X-risk more likely? (this wasn’t spelled out clearly)). I actually couldn’t suss out what Eliezer thinks from this conversation—possibly that humanity’s odds are basically independent of the achieved level of technology, or that the world ends significantly sooner than we’ll be able to deploy transformative tech, so the point is moot. I wish y’all had nailed this down further.
Despite the frustration, this was fantastic content, and I’m excited for future installments.
Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.
Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.
I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.