There's a striking inverted V-shape there (which would be even clearer if you fit a segmented regression with 1 break point), where it's drifting up for all of 2021, only to suddenly reverse and drop over like a stone 2022-2023.
What's the qualitative characterization of Metaculus collective views there? At least from my perspective, 2021 seemed in line with scaling expectations with tons of important DL research demonstrating the scaling hypothesis & blessings of scale. I don't remember anything that would've been pivotal around what looks like a breakpoint Dec 2021 - if it had happened around or after Feb 2022, then you could point to Chinchilla / Gato / PaLM / Flamingo / DALL-E 2, but why Dec 2021?
It's currently at -003 and not the new ChatGPT 3.5 endpoint because when I dropped in the chat model name, the code errored out - apparently it's under a
chat/ path and so the installed OA Py library errors out. I haven't bothered to debug it any further (do I need to specify the engine name as
chat/turbo-gpt-3 or do I need to upgrade the library to some new version or what). I haven't even tried GPT-4 - I have the API access, just been too fashed and busy with other site stuff.
(Technical-wise, we've been doing a lot of Gwern.net refactoring and cleanup an...
What's there to highlight, really? The point is that it looks like a normal abstract... but not one-paragraph. (I've mused about moving in a much more aggressive Elicit-style direction and trying to get a GPT to add the standardized keywords where valid but omitted. GPT-4 surely can do that adequately.)
I suppose if you want a comparison, skimming my newest, the first entry right now is Sánchez-Izquierdo et al 2023 and that is an example of reformatting an abstract to add linebreaks which improve its readability:
This is not a complex abstract and far from ...
I agree that formatting abstracts as single paragraph blocks is surprisingly bad for comprehension; I think it is because abstracts are deceptively difficult for the reader, as they tend to invoke a lot of extremely novel & unusual keywords/concepts and make new claims within the space of a few sentences (not infrequently dumping in many numbers & statistical results into parentheticals, which might have a dozen stats in less space than this), and that they are deceptively easy for the authors to read because they suffer from the curse of expertise...
Take Hanson himself: he has about 100 academic publications, two big books, and something like 3000 blog posts. Which will be his biggest contribution in the end?
Is this a trick question? Obviously the blog posts. The em book (based heavily on blog drafts) had zero impact and is based on a vision of the future that recedes every day (if it's lucky, it might get revived in a few decades as period science fiction), and the co-authored Elephant book was just a popularization of the blog posts. The academic publications may look prestigious but let's be rea...
I don't get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!).
But IIRC, doesn't OA also mention that to get better results they had to add in continual training of the model on the original raw data? That's much like a KL penalty. (I don't recall the RL-training part of the Anthropic HH paper, just the prompted agent parts.)
You suggest that td-003 mode collapses where t
This is an interesting attempt, but doesn't convince me that Janus is wrong about the phenomenon. There is a long history of RL having mode-collapse for intuitive reasons and needing hacks to stop it, the GPT-4 paper confirms that the RL version of the model can act drastically differently in terms of calibration, and the narrowness of ChatGPT/003 is incredibly striking: every single person who generates more than a few poems seems to remark on the obstinate refusal to generate the kinds of poems that 002 or davinci generate with the greatest of ease (comp...
This is also relevant to understanding why the genre of off-the-cuff tossoffs like "what if corporations are the real superintelligence" or "why can't we solve AGI alignment the same way we solved 'aligning corporations'?" are so wrong. Corporations are not superintelligences. They are, in fact, extremely stupid, much stupider than the sum of their parts, and subject to only the weakest forms of natural selection due to their inability to replicate themselves reliably despite the permanent existence of very large dispersion in efficiency/quality between co...
Microtargeting groups has already been observed to cause preference drift over time leading to large consequences (e.g. Cambridge Analytica).
Reminder: no, it didn't, as you can see by scrolling the relevant WP section about impact (ie. consequences), not the irrelevant scandal section. We know from actual political advertising experiments their data was far too crude to make any impact (not that anything they said wasn't a tissue of advertising & lies), and they weren't even used by the Trump campaign!
Interesting tweet from Marcus 2 days ago:
There can be zero doubt that GPT-4 is better than GPT-3–but also I cannot imagine how we are supposed to achieve ethical and safety “alignment” with a system that cannot understand the word “third” even with billions of training examples.
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days - as I said, this currently looks like 'DALL-E 1 but bigger' (VQVAE tokens -> token sequence -> autoregressive modeling of text/image tokens). What we have seen so far doesn't look like 3 years of progress by the best DL researchers.
The lack of GPT-4 in 2020-mid-2021 wasn't too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...
What I find a little more curious is no successor in 2021-2022, and that it wasn't until August 2022 that GPT-4 finished training, wit...
The Copilot/Codex tokenization includes a lot of whitespace tokens to deal with exactly this, and many other cases that you can't solve as easily as swapping in tabs. I wonder when there will be a GPT-4-Codex model available?
That's interesting. Earlier, he was very explicitly identifying temperature with creativity in the Tweets I collated when commenting about how the controls worked. So now if the temperature is identical but he's calling whatever it is 'creative', he's completely flipped his position on "hallucinations = creativity", apparently.
Hm. So it's the same temperature, but it's more expensive, which has 'longer output, more expressive, slower', requires more context... That could point to it being a different model under the hood. But it could also point to a diffe...
The latter. I am quite certain that hugely superior architectures exist in the sense of both superior exponents and superior constants (but I'm less sure about being hugely strictly dominated on both), and these are the sorts of things that are what the whole hierarchy of meta-learning is about learning/locating; but that the current sets of architectures are all pretty much alike in being big blobs of feedforward architectures whose inductive biases wash out at what is, in absolute terms, quite small scales (so small scale we can achieve them right now wi...
I think you would get diminishing returns but reading a few hundred thousand tokens would teach you quite a lot, and I think likely more than knowing Transformers would. I'm not convinced that Transformers are all that important (architectures seem to converge at scale, you can remove attention entirely without much damage, not much of the FLOPS is self-attention at this point, etc), but you learn a lot about why GPT-3 is the way it is if you pay attention to the data. For example, BPEs/poetry/puns: you will struggle in vain to explain the patterns of GPT ...
Part of the confidence came from Bing’s success in answering correctly when set to precise mode. Many speculated GPT-4 was going to be even more powerful than Bing, even though they turned out to be the same. I’m not exactly sure what the difference is using the “precise” setting, if anyone knows let me know!
Based on Mikhail's Twitter comments, 'precise' and 'creative' don't seem to be too much more than simply the 'temperature' hyperparameter for sampling. 'Precise' would presumably correspond to very low, near-zero or zero, highly deterministic sample...
Based on Mikhail's Twitter comments, 'precise' and 'creative' don't seem to be too much more than simply the 'temperature' hyperparameter for sampling. 'Precise' would presumably correspond to very low, near-zero or zero, highly deterministic samples.
Nope, Mikhail has said the opposite: https://twitter.com/MParakhin/status/1630280976562819072
Nope, the temperature is (roughly) the same.
So I'd guess the main difference is in the prompt.
I don't think that would really be possible outside OA until they open up the image-input feature, which they haven't AFAIK. You could try to do the number-array approach I think someone has suggested, but given how heavily ARC exploits human-comprehensible visual symmetries & patterns, the results would be a lower-bound at best.
I'm a little confused: I feel like I read this post already, but I can't find it. Was there a prior deleted version?
You did see part of it before; I posted in Open Thread a month ago with the announcement, but today Ray poked me and Oli to also publish some of the reasoning we wrote in slack.
You can just append
#page=3. This works in most PDF viewers. (There are many query parameters that Adobe supports but that's really the only one you need to know about.)
tldw: Brockman showed up some straightforward demos of GPT-4's text & code writing versatility, and some limited demo of its image processing, but you aren't missing anything insightful about the arch/training/scaling/future/etc.
If you were willing to hypothesize a specific scaling law, sure. But it seems like the only safe one to hypothesize is 'better than Scaling Transformer/Chinchilla/zero-shot'.
Sure, but the point is that those theories are much less likely than if GPT-3.5 had done it too.
I too was a bit surprised. Critch should probably have emphasized the hello-world twist a bit more: I don't spend much time reading quines or recreational programming, so I was assuming it could've been memorized and wasn't sure that that was 'novel' (there are lots of quine 'genres', like multilingual quines or 'radiation-hardened' quines) until I'd look through a bunch of results and noticed none of them had that. So his point is not that quines are somehow incredibly amazing & impossible to write hitherto, but that it's gotten good enough at code-writing that it can meaningful modify & adapt quines.
Well, why didn't GPT3.5 also copy it if it was in the training data? As well, I've never seen the specification of 'print hello world' in a quine before, and checking the first 5 hits for
python quine print(source_code.format(source_code)), out of the dozen or so Python quines, none of them look blatantly like this example nor do any of them print out hello-world optionally.
Why didn't GPT-3.5 also copy it if it was in the training data?
Two possible answers:
That being said, I also briefly tried to search for this quine online and couldn't find anything. So I agree, it probably does exhibit this new ability. The reason I was suspicious at first is because the quine prompt seemed generic enough that it could have existed before, but I see that's not the case.
You can definitely infer quite a bit from the paper and authors by section, but there is a big difference between a plausible informed guess, and knowing. For most purposes, weak inferences are not too useful. 'Oh, this is Chinchilla, this is VQ-VAE, this is Scaling Transformer...' For example, the predicting-scaling part (and Sam Altman singling out the author for praise) is clearly the zero-shot hyperparameter work, but that's not terribly helpful, because the whole point of scaling laws (and the mu work in particular) is that if you don't get it right, ...
No, it already is, it's just apparently staggered. EDIT: should be available to everyone now and I've also received API access.
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
This non-news seems like it might be the biggest news in the announcement? OpenAI is saying "oops publishing everything was too open, its gonna be more of a black box now".
GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
Some capabilities are still hard to predict. For example, the Inverse Scaling Prize was a competition to find a metric that gets worse as model compute increases, and "hindsight neglect" was one of the winners. Just like with another recent result, GPT-4 reverses the trend:
[Inverse Scaling Prize, hindsight neglect: GPT-4 goes to ~100%]
(Paper doesn't seem to provide any additional information on i...
To choose an anti-good Luigi to get a good Waluigi?
I'm not sure what you mean by that. In literary terms, would that just be an evil protagonist who may at some point have the twist of turning out to secretly be genuinely good? But there don't seem to be too many stories or histories like that, and the ones that start with evil protagonist usually end with that: villains like Hitler, Stalin, Mao, or Pol Pot don't suddenly redeem themselves spontaneously. (Stories where the villain is redeemed almost always start with a good Luigi/hero, like Luke Skywalk...
You wouldn't be able to do that because the raw weights would require context windows of millions or billions. Approaches to meta-learning fast weights require more tailored approaches; a good recent example is the meta-learning diffusion model "Gpt". (Yes, that is really its name - possibly the worst named DL result of 2022.)
We won't be seeing optimally trained hundreds of trillions dense parameter models anytime soon.
We won't be seeing them Chinchilla-trained, of course, but that's a completely different claim. Chinchilla scaling is obviously suboptimal compared to something better, just like all scaling laws before it have been. And they've only gone one direction: down.
Dall-E and other image generation models can't reliably get cardinalities right beyond something like 3.
That's just BPEs, and it's worth noting that the other models (and by 'other', I do not mean "Stable Diffusion" like a lazy person would, I mean actual SOTAs) are much more capable of following long intricate prompts eg. Parti.
What's going on with ChatGPT is also likely due to BPEs (as has been the case with GPT when asked to do letter-related tasks going back to GPT-2, as I have been repeating for years) plus perhaps some RLHF shenanigans. (RLHF seem...
GPT-4 will be a text-only model like GPT-3.
That was a very long time ago, and the current GPT-4, you agree, would've been trained well after. So we can't put too much weight on it.
The current reporting is pushing very heavily on multimodality, including the announcement by Microsoft Germany that it will be released next week & explicitly stating GPT-4 will do video, in addition to earlier reporting about images (as well as Altman musing about revenge porn).
he did say that GPT-4 would use a different loss function, what effect would that have? I have no idea.
One possibility is shifting the power law. See UL2 which combines the various denoising losses in what turns out to be a very good way: "U-PaLM: Transcending Scaling Laws with 0.1% Extra Compute", Tay et al 2022 - halving PaLM training requirements w/UL2 losses. I don't know if OA discovered UL2 first, but it's not all that exotic or subtle and is certainly something that many people ask themselves when they learn about the difference between bidirect...
So it's very possible (albeit unlikely) that the number of total GPUs used for GPT-4 training could be higher than 15000!
OAers have noted that the cluster has, of course, been expanded heavily since the original 10k (albeit not what it is now). Morgan Stanley is saying that GPT-5 is being trained right now on 25,000 GPUs, up heavily from the original 10k, and implying that 'most' of the GPT-5 GPUs were used for GPT-4 which finished 'some time ago'; the mean of 10 & 25 is 17.5, so >15k seems entirely possible, especially if those GPUs weren't just...
That doesn't sound too hard. Why does it have to generate a query's result? Why can't it just have a convention to 'write a well-formed query, and then immediately after, write the empty string if there is no response after the query where an automated tool ran out-of-band'? It generates a query, then always (if conditioned on just the query, as opposed to query+automatic-Internet-access-generated-response) generates "", and sees it generates "", and knows it didn't get an answer. I see nothing hard to learn about that.
The model could also simply note that...
These may be true, but it is unclear how they are relevant to explaining the recent trends and how they differ by groups. There is, and long has been, intense state & parental control of childrens' lives and often not for the better: but how does that explain a change in trends in 2011 to increase, prior decreases in the 1990s, experimental results like quitting social media (where parental/state oversight is minimal) apparently increasing mental health, or differences like 'liberal girls are more affected than conservative girls'?
Think of it like AlphaGo - if it only ever could train itself by playing Go against actual humans, it would never have become superintelligent at Go.
This is obviously untrue in both the model-free and model-based RL senses. There are something like 30 million human Go players who can play a game in two hours. AlphaGo was trained on policy gradients from, as it happens, on the order of 30m games; so it could accumulate a similar order of games in under a day; the subset of pro games can be upweighted to provide most of the signal - and when they stop pro...
It didn't have to be revealed. That Quirrel was Voldemort was obvious almost within the first chapter introducing him (eg already taken for granted in the earliest top-level discussion page in 2010), to the extent that fandebate over 'Quirrelmort' was mostly "he's so obviously Voldemort, just like in canon - but surely Yudkowsky would never make it that easy, so could he possibly be someone or something else? An impostor? some sort of diary-like upload? a merger of Quirrel/Voldemort? Harry from the future?" Bragging about being the inspiration for a charac...
So the current on-LW version has always just been there, or something?
If maintenance were the rationale (which is perfectly reasonable), they could have just dumped the static HTML and called it a day. It's not like the site changed or had any special interactive or dynamic aspects. (And such an archival scrape would likely have been substantially less effort than this special integration has been, and will be, I predict.)
No, you're doing it wrong, as I already explained. You're letting GPT fall back onto its policy by choosing any response. You need to force it out of its comfort zone. Ask it to explain a pun it did not write, or answer questions like whether a pair of words that you picked rhyme. Write pairs of new words that have never been seen before, etc. The task of 'come up with a memorized rhyme for reasonably common words' does not disprove extensive memorization or show that it has failed to understand the underlying phonetics.
Reading Dan Wang's belated letter, where he describes Shanghai and the abrupt collapse of Zero Covid, reminds me of one interesting aspect for us of base rates, Outside View reasoning, rocks which say 'everything is fine, and the difference between probabilities & decisions:
for a long time, it was obvious that Zero Covid was not working, especially once increasingly infectious variants meant that there was community spread & all the measures had failed to make r<<1 but was hovering at r=1, and they were burning through patience, time, and mon...
The claim that it reached 100 million users within two months has been reported by many news outlets, which all seem to bottom out in data from Similarweb. I was not able to find a detailed report, but it looks like they have more data behind a paywall. I think it’s reasonable to accept this claim for now, but, again, it might be different in some way from what the media is reporting1.
FWIW, there's an article somewhere quoting an OAer saying that the real number was more like half that.
Of course, what means is still highly ambiguous. Accounts which logg...
I think so. If someone could show that BPEs were changing the scaling laws on an important task end-users will pay for, then it wouldn't be hard to change that: for example, I noted that Codex induced OA to change BPEs, because that substantially increased the effective context window when you generate BPEs optimized for programming language syntax, which matters to big paying customers like Github (the larger the ctx, the more the variables & definitions inside a specific project are available for relevant completion). Otherwise, the general attitude ...
Yeah, I will at some point, but frontend work with Said always comes first. If you want to patch it yourself, I'd definitely try it.