All of gwern's Comments + Replies

Yeah, I will at some point, but frontend work with Said always comes first. If you want to patch it yourself, I'd definitely try it.

3gwillen2d
https://github.com/gwern/gwern.net/pull/6 [https://github.com/gwern/gwern.net/pull/6] It would be exaggerating to say I patched it; I would say that GPT-4 patched it at my request, and I helped a bit. (I've been doing a lot of that in the past ~week.)

There's a striking inverted V-shape there (which would be even clearer if you fit a segmented regression with 1 break point), where it's drifting up for all of 2021, only to suddenly reverse and drop over like a stone 2022-2023.

What's the qualitative characterization of Metaculus collective views there? At least from my perspective, 2021 seemed in line with scaling expectations with tons of important DL research demonstrating the scaling hypothesis & blessings of scale. I don't remember anything that would've been pivotal around what looks like a breakpoint Dec 2021 - if it had happened around or after Feb 2022, then you could point to Chinchilla / Gato / PaLM / Flamingo / DALL-E 2, but why Dec 2021?

2Gabriel Mukobi2d
Good idea. Here's just considering the predictions starting in 2022 and on. Then, you get a prediction update rate of 16 years per year. Halloween, 2023, is when the AGI is extrapolated to arrive...
2dsj2d
Seems to me there's too much noise to pinpoint the break at a specific month. There are some predictions made in early 2022 with an even later date than those made in late 2021. But one pivotal thing around that time might have been the chain-of-thought stuff which started to come to attention then (even though there was some stuff floating around Twitter earlier).

It's currently at -003 and not the new ChatGPT 3.5 endpoint because when I dropped in the chat model name, the code errored out - apparently it's under a chat/ path and so the installed OA Py library errors out. I haven't bothered to debug it any further (do I need to specify the engine name as chat/turbo-gpt-3 or do I need to upgrade the library to some new version or what). I haven't even tried GPT-4 - I have the API access, just been too fashed and busy with other site stuff.

(Technical-wise, we've been doing a lot of Gwern.net refactoring and cleanup an... (read more)

3gwillen3d
The better models do require using the chat endpoint instead of the completion endpoint. They are also, as you might infer, much more strongly RL trained for instruction following and the chat format specifically. I definitely think it's worth the effort to try upgrading to gpt-3.5-turbo, and I would say even gpt-4, but the cost is significantly higher for the latter. (I think 3.5 is actually cheaper than davinci.) If you're using the library you need to switch from Completion to ChatCompletion, and the API is slightly different -- I'm happy to provide sample code if it would help, since I've been playing with it myself, but to be honest it all came from GPT-4 itself (using ChatGPT Plus.) If you just describe what you want (at least for fairly small snippets), and ask GPT-4 to code it for you, directly in ChatGPT, you may be pleasantly surprised. (As far as how to structure the query, I would suggest something akin to starting with a "user" chat message of the form "please complete the following:" followed by whatever completion prompt you were using before. Better instructions will probably get better results, but that will probably get something workable immediately.)

What's there to highlight, really? The point is that it looks like a normal abstract... but not one-paragraph. (I've mused about moving in a much more aggressive Elicit-style direction and trying to get a GPT to add the standardized keywords where valid but omitted. GPT-4 surely can do that adequately.)

I suppose if you want a comparison, skimming my newest, the first entry right now is Sánchez-Izquierdo et al 2023 and that is an example of reformatting an abstract to add linebreaks which improve its readability:

https://gwern.net/doc/www/misc/xwd-16796729702197598.png

This is not a complex abstract and far from ... (read more)

I agree that formatting abstracts as single paragraph blocks is surprisingly bad for comprehension; I think it is because abstracts are deceptively difficult for the reader, as they tend to invoke a lot of extremely novel & unusual keywords/concepts and make new claims within the space of a few sentences (not infrequently dumping in many numbers & statistical results into parentheticals, which might have a dozen stats in less space than this), and that they are deceptively easy for the authors to read because they suffer from the curse of expertise... (read more)

3gwillen3d
Have you considered switching to GPT-3.5 or -4? You can get much better results out of much less prompt engineering. GPT-4 is expensive but it's worth it.
2Raemon4d
Do you have a link to a specific part of the gwern site highlighting this, and/or a screenshot?

Take Hanson himself: he has about 100 academic publications, two big books, and something like 3000 blog posts. Which will be his biggest contribution in the end?

Is this a trick question? Obviously the blog posts. The em book (based heavily on blog drafts) had zero impact and is based on a vision of the future that recedes every day (if it's lucky, it might get revived in a few decades as period science fiction), and the co-authored Elephant book was just a popularization of the blog posts. The academic publications may look prestigious but let's be rea... (read more)

1Bill Benzon6d
I see what you mean. Hanson himself may not be a good example of the argument the author is making. What about the point Hanson himself was making in the quoted passage? Hanson himself does accept the discipline of publishing in the formal literature and he also uses his blog posts as a vehicle for developing more disciplined ideas. Blogging is one thing when it is part of a larger intellectual strategy. But if blogging – & Twitter, Reddit, Facebook and whatever else – is all there is, what happens to rigor then?

I don't get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!).

But IIRC, doesn't OA also mention that to get better results they had to add in continual training of the model on the original raw data? That's much like a KL penalty. (I don't recall the RL-training part of the Anthropic HH paper, just the prompted agent parts.)

You suggest that td-003 mode collapses where t

... (read more)
3Arthur Conmy7d
That's true, I think the pretraining gradients training choice probably has more effect on the end model than the overfitting SFT model they start PPO with. Huh, but Mysteries of mode collapse (and the update) were published before td-003 was released? How would you have ended up reading a post claiming td-002 was RLHF-trained when td-003 existed? Meta note: it's plausibly net positive that all the training details of these models has been obfuscated, but it's frustrating how much energy has been sunk into speculation on The Way Things Work Inside OpenAI.

This is an interesting attempt, but doesn't convince me that Janus is wrong about the phenomenon. There is a long history of RL having mode-collapse for intuitive reasons and needing hacks to stop it, the GPT-4 paper confirms that the RL version of the model can act drastically differently in terms of calibration, and the narrowness of ChatGPT/003 is incredibly striking: every single person who generates more than a few poems seems to remark on the obstinate refusal to generate the kinds of poems that 002 or davinci generate with the greatest of ease (comp... (read more)

0skulk-and-quarrel7d
The impact in ChatGPT could be potentially due to longer prompts or the "system prompt". It would be great to test that in a similar analysis
1Arthur Conmy7d
I wasn't trying to say mode collapse results were wrong! I collected these results before finding crisper examples of mode collapse that I could build a useful interpretability project on. I also agree with the remarks made about the difficulty of measuring this phenomena. I indeed tried to use the OpenAI embeddings model to encode the various completions and then hopefully have the Euclidean distance be informative, but it seemed to predict large distances for similar completions so I gave up. I also made a consistent color scheme and compared code-davinci, thanks for those suggestions.  I don't get the impression that RLHF needs hacks to prevent mode collapse: the InstructGPT reports overfitting leading to better human-rater feedback, and the Anthropic HH paper mentions in passing that the KL penalty may be wholly irrelevant (!). I'm not sure how to interpret the evidence from your first paragraph. You suggest that td-003 mode collapses where td-002 is perfectly capable. So you believe that both td-002 and td-003 mode collapse, in disjoint cases (given the examples from the original mode collapse post)?

This is also relevant to understanding why the genre of off-the-cuff tossoffs like "what if corporations are the real superintelligence" or "why can't we solve AGI alignment the same way we solved 'aligning corporations'?" are so wrong. Corporations are not superintelligences. They are, in fact, extremely stupid, much stupider than the sum of their parts, and subject to only the weakest forms of natural selection due to their inability to replicate themselves reliably despite the permanent existence of very large dispersion in efficiency/quality between co... (read more)

2lc6d
And given the SEC.

Microtargeting groups has already been observed to cause preference drift over time leading to large consequences (e.g. Cambridge Analytica).

Reminder: no, it didn't, as you can see by scrolling the relevant WP section about impact (ie. consequences), not the irrelevant scandal section. We know from actual political advertising experiments their data was far too crude to make any impact (not that anything they said wasn't a tissue of advertising & lies), and they weren't even used by the Trump campaign!

1markov9d
Thanks for pointing that out! It's embarrassing that I made a mistake, but it's also relieving in some sense to learn that the impacts were not as I had thought them to be. I hope this error doesn't serve to invalidate the entire post. I don't really know what the post-publishing editing etiquette is, but I don't want to keep anything in the post that might serve as misinformation so I'll edit this line out. Please let me know if there are any other flaws you find and I'll get them fixed.

Interesting tweet from Marcus 2 days ago:

There can be zero doubt that GPT-4 is better than GPT-3–but also I cannot imagine how we are supposed to achieve ethical and safety “alignment” with a system that cannot understand the word “third” even with billions of training examples.

3ZeroRelevance10d
I always get annoyed when people use this as an example of 'lacking intelligence'. Though it certainly is in part an issue with the model, the primary reason for this failure is much more likely the tokenization process than anything else. A GPT-4, likely even a GPT-3, trained with character-level tokenization would likely have zero issues answering these questions. It's for the same reason that the base GPT-3 struggled so much with rhyming for instance.
4baturinsky10d
He refers to the test questions about the third words and letter, etc. I think in that case errors stem from the GPT4 's weakness with low-level properties of character strings, not from it's weakness with numbers. If you ask it about "What is the third digit of the third three-digit prime?" it will answer correctly (ChatGPT won't).
3Roman Leventov10d
What is interesting about this tweet? That Marcus turns to the alignment problem?

There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days - as I said, this currently looks like 'DALL-E 1 but bigger' (VQVAE tokens -> token sequence -> autoregressive modeling of text/image tokens). What we have seen so far doesn't look like 3 years of progress by the best DL researchers.

2sairjy8d
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.  The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on. 

The lack of GPT-4 in 2020-mid-2021 wasn't too surprising to me. They were busy productizing, optimizing, launching, and had no genuine competition. Everyone with a plausibly competitive model was not releasing it, and the ones which were available were not convincingly better. Why invest or release? Jurassic-1 in July 2021 was the first public API, but I never heard anyone call it noticeably better than davinci. Tick-tock...

What I find a little more curious is no successor in 2021-2022, and that it wasn't until August 2022 that GPT-4 finished training, wit... (read more)

1Gerald Monroe11d
Umm...the vision?  How did they even train it? Assuming they did it like Gato: • Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).

The Copilot/Codex tokenization includes a lot of whitespace tokens to deal with exactly this, and many other cases that you can't solve as easily as swapping in tabs. I wonder when there will be a GPT-4-Codex model available?

That's interesting. Earlier, he was very explicitly identifying temperature with creativity in the Tweets I collated when commenting about how the controls worked. So now if the temperature is identical but he's calling whatever it is 'creative', he's completely flipped his position on "hallucinations = creativity", apparently.

Hm. So it's the same temperature, but it's more expensive, which has 'longer output, more expressive, slower', requires more context... That could point to it being a different model under the hood. But it could also point to a diffe... (read more)

The latter. I am quite certain that hugely superior architectures exist in the sense of both superior exponents and superior constants (but I'm less sure about being hugely strictly dominated on both), and these are the sorts of things that are what the whole hierarchy of meta-learning is about learning/locating; but that the current sets of architectures are all pretty much alike in being big blobs of feedforward architectures whose inductive biases wash out at what is, in absolute terms, quite small scales (so small scale we can achieve them right now wi... (read more)

I think you would get diminishing returns but reading a few hundred thousand tokens would teach you quite a lot, and I think likely more than knowing Transformers would. I'm not convinced that Transformers are all that important (architectures seem to converge at scale, you can remove attention entirely without much damage, not much of the FLOPS is self-attention at this point, etc), but you learn a lot about why GPT-3 is the way it is if you pay attention to the data. For example, BPEs/poetry/puns: you will struggle in vain to explain the patterns of GPT ... (read more)

1Gerald Monroe11d
Does this mean hugely superior architectures to transformers (measured by benchmarking them with the same compute and data input) don't exist or that transformers and RNNs and everything else are all close enough cousins?

Part of the confidence came from Bing’s success in answering correctly when set to precise mode. Many speculated GPT-4 was going to be even more powerful than Bing, even though they turned out to be the same. I’m not exactly sure what the difference is using the “precise” setting, if anyone knows let me know!

Based on Mikhail's Twitter comments, 'precise' and 'creative' don't seem to be too much more than simply the 'temperature' hyperparameter for sampling. 'Precise' would presumably correspond to very low, near-zero or zero, highly deterministic sample... (read more)

Based on Mikhail's Twitter comments, 'precise' and 'creative' don't seem to be too much more than simply the 'temperature' hyperparameter for sampling. 'Precise' would presumably correspond to very low, near-zero or zero, highly deterministic samples.

Nope, Mikhail has said the opposite: https://twitter.com/MParakhin/status/1630280976562819072

Nope, the temperature is (roughly) the same.

So I'd guess the main difference is in the prompt.

1Gerald Monroe12d
Note that 2 stage generation - ask it if it's sure about its answer and use the second response as the output - solves Monty fall every time I tried.

I don't think that would really be possible outside OA until they open up the image-input feature, which they haven't AFAIK. You could try to do the number-array approach I think someone has suggested, but given how heavily ARC exploits human-comprehensible visual symmetries & patterns, the results would be a lower-bound at best.

I'm a little confused: I feel like I read this post already, but I can't find it. Was there a prior deleted version?

You did see part of it before; I posted in Open Thread a month ago with the announcement, but today Ray poked me and Oli to also publish some of the reasoning we wrote in slack.

You can just append #page=3. This works in most PDF viewers. (There are many query parameters that Adobe supports but that's really the only one you need to know about.)

tldw: Brockman showed up some straightforward demos of GPT-4's text & code writing versatility, and some limited demo of its image processing, but you aren't missing anything insightful about the arch/training/scaling/future/etc.

2Quadratic Reciprocity12d
The really cool bit was when he had a very quick mockup of a web app drawn on a piece of paper and uploaded a photo of it and GPT-4 then used just that to write the HTML and JavaScript for the app based on the drawing. 

If you were willing to hypothesize a specific scaling law, sure. But it seems like the only safe one to hypothesize is 'better than Scaling Transformer/Chinchilla/zero-shot'.

2ESRogs12d
Better meaning more capability per unit of compute? If so, how can we be confident that it's better than Chinchilla? I can see an argument that it should be at least as good — if they were throwing so much money at it, they would surely do what is currently known best practice. But is there evidence to suggest that they figured out how to do things more efficiently than had ever been done before?

Sure, but the point is that those theories are much less likely than if GPT-3.5 had done it too.

I too was a bit surprised. Critch should probably have emphasized the hello-world twist a bit more: I don't spend much time reading quines or recreational programming, so I was assuming it could've been memorized and wasn't sure that that was 'novel' (there are lots of quine 'genres', like multilingual quines or 'radiation-hardened' quines) until I'd look through a bunch of results and noticed none of them had that. So his point is not that quines are somehow incredibly amazing & impossible to write hitherto, but that it's gotten good enough at code-writing that it can meaningful modify & adapt quines.

2TekhneMakre13d
Surely one should look for ones that are like "Quine given argument 1, output [something else] given argument 2". The presence or absence of this sort of already very modular template being in the data would give better context.

Well, why didn't GPT3.5 also copy it if it was in the training data? As well, I've never seen the specification of 'print hello world' in a quine before, and checking the first 5 hits for python quine print(source_code.format(source_code)), out of the dozen or so Python quines, none of them look blatantly like this example nor do any of them print out hello-world optionally.

1[anonymous]12d
If you have a very large training dataset and the phenomenon of interest is sparsely represented in that training data, it's well known that as we increase the number of parameters of the model, its ability to accurately handle those cases increases. Unless there is any evidence against that simple explanation, it seems most natural to just think that the GPT4 has the required model complexity to consistently handle this somewhat rare coding phenomenon - and that GPT3.5 did not. However, I would be surprised if after poking at GPT3.5 to do quines repeatedly, that we could not get it to do something similar. In fact, having just tried myself, it gave me a perfectly elegant quine:  
1DominikPeters13d
Github code searches for "python quine" and "python quine format" also don't throw up things I'd call similar.

Why didn't GPT-3.5 also copy it if it was in the training data?

Two possible answers:

  • The quine wasn't in the training data of GPT-3.5 but was in the training data of GPT-4
  • GPT-4 is better at "retrieving" answers from the training data

That being said, I also briefly tried to search for this quine online and couldn't find anything. So I agree, it probably does exhibit this new ability. The reason I was suspicious at first is because the quine prompt seemed generic enough that it could have existed before, but I see that's not the case.

You can definitely infer quite a bit from the paper and authors by section, but there is a big difference between a plausible informed guess, and knowing. For most purposes, weak inferences are not too useful. 'Oh, this is Chinchilla, this is VQ-VAE, this is Scaling Transformer...' For example, the predicting-scaling part (and Sam Altman singling out the author for praise) is clearly the zero-shot hyperparameter work, but that's not terribly helpful, because the whole point of scaling laws (and the mu work in particular) is that if you don't get it right, ... (read more)

2M. Y. Zuo12d
Is there even enough training data for GPT-5? (Assuming it's goal is to 50x or 100x GPT-4)
9dmurfet12d
By the zero-shot hyperparameter work do you mean https://arxiv.org/abs/2203.03466 [https://arxiv.org/abs/2203.03466] "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"? I've been sceptical of NTK-based theory, seems I should update.

No, it already is, it's just apparently staggered. EDIT: should be available to everyone now and I've also received API access.

OA: https://cdn.openai.com/papers/gpt-4.pdf#page=2

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

4avturchin13d
Could we infer parameters' number from scaling laws?

This non-news seems like it might be the biggest news in the announcement? OpenAI is saying "oops publishing everything was too open, its gonna be more of a black box now".

GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:

Some capabilities are still hard to predict. For example, the Inverse Scaling Prize was a competition to find a metric that gets worse as model compute increases, and "hindsight neglect" was one of the winners. Just like with another recent result, GPT-4 reverses the trend:

[Inverse Scaling Prize, hindsight neglect: GPT-4 goes to ~100%]

(Paper doesn't seem to provide any additional information on i... (read more)

3Tapatakt13d
It is not clear if this happened on its own, or if they deliberately trained the model not to make such mistakes. Perhaps, in similar future studies, it is worth keeping half of the found tasks in secret in order to test future models with them.

(Deleting in favor of a slightly prior submission.)

To choose an anti-good Luigi to get a good Waluigi?

I'm not sure what you mean by that. In literary terms, would that just be an evil protagonist who may at some point have the twist of turning out to secretly be genuinely good? But there don't seem to be too many stories or histories like that, and the ones that start with evil protagonist usually end with that: villains like Hitler, Stalin, Mao, or Pol Pot don't suddenly redeem themselves spontaneously. (Stories where the villain is redeemed almost always start with a good Luigi/hero, like Luke Skywalk... (read more)

5Robert_AIZI13d
I think these meet your criterion of starting solely with anti-good characters: 1. Cecil from FF4 [https://finalfantasy.fandom.com/wiki/Cecil_Harvey] starts as a literal dark knight before realizing he's working for an evil empire, becoming a paladin, and saving the world. 2. John Preston from Equilibrium [https://en.wikipedia.org/wiki/Equilibrium_(film)] (the protagonist, played by Christian Bale) is a fascist secret police agent until he accidentally feels emotion, then realizes that anti-emotion fascism is bad and overthrows it. 3. Megamind from Megamind [https://en.wikipedia.org/wiki/Megamind] is a supervillain who realizes that actually he should be a hero. (Maybe this shouldn't count because there's initially a superhero? But the protagonist is Megamind throughout.) 4. Grace from Infinity Train season 3 [https://infinity-train.fandom.com/wiki/Grace_Monroe] starts as a cult leader trying to maximize the in-universe utility function (literally! [https://infinity-train.fandom.com/wiki/Numbers]), but got the sign wrong so she's absolutely terrible. But she meets a small child and realizes she's terrible and works to overcome that. 5. Gru from Despicable Me [https://en.wikipedia.org/wiki/Despicable_Me] starts out a supervillain but eventually becomes a loving father and member of the "Anti-Villain League". 6. Joel from The Last of Us [https://en.wikipedia.org/wiki/The_Last_of_Us] is a murderer in the post-apocalypse who is redeemed by finding a surrogate daughter figure and at the end of the story... I have been advised this is not a suitable role-model for an AI, please disregard. Some themes of such redemption stories (safety implications left to the reader): 1. Adopting one or more children (1, 4, 5, 6) 2. Having an even eviler version of yourself to oppose (2, 3, 4, 5)

You wouldn't be able to do that because the raw weights would require context windows of millions or billions. Approaches to meta-learning fast weights require more tailored approaches; a good recent example is the meta-learning diffusion model "Gpt". (Yes, that is really its name - possibly the worst named DL result of 2022.)

We won't be seeing optimally trained hundreds of trillions dense parameter models anytime soon.

We won't be seeing them Chinchilla-trained, of course, but that's a completely different claim. Chinchilla scaling is obviously suboptimal compared to something better, just like all scaling laws before it have been. And they've only gone one direction: down.

Dall-E and other image generation models can't reliably get cardinalities right beyond something like 3.

That's just BPEs, and it's worth noting that the other models (and by 'other', I do not mean "Stable Diffusion" like a lazy person would, I mean actual SOTAs) are much more capable of following long intricate prompts eg. Parti.

What's going on with ChatGPT is also likely due to BPEs (as has been the case with GPT when asked to do letter-related tasks going back to GPT-2, as I have been repeating for years) plus perhaps some RLHF shenanigans. (RLHF seem... (read more)

1p.b.13d
BPEs explain the string examples but I also couldn't get ChatGPT to accurately count words. Edit:  An examples that I just now tried with GPT-3.5-turbo: User: How many times does the word "cat" occur in the following sentence: A cat was a cat in a cat with a cat as a hat and picture of a cat on the cat hat. Assistant: The word "cat" occurs 5 times in the sentence.   Because apparently I don't have it in me to concentrate on work today, here is some more experimentation with ChatGPT: Me: How many times does the word "cat" occur in the following sentence: A cat was a cat in a cat with a cat as a hat and picture of a cat on the cat hat. ChatGPT: The word "cat" occurs five times in the sentence. Me: Could you repeat the sentence but put a number after each occurence of "cat" counting up. ChatGPT: Sure, here's the sentence with a number after each occurrence of "cat" counting up: "A cat (1) was a cat (2) in a cat (3) with a cat (4) as a hat and picture of a cat (5) on the cat hat." So I thought, maybe "a cat" is one token for some reason and I tried:  Me: Now, count the number of occurences of the word "permeable" in the following sentence: A permeable membrane is more permeable if it is more permeable rather than less permeable than a less permeable membrane is permeable. ChatGPT: The word "permeable" appears four times in the sentence. Me: Could you again repeat the sentence and put a number in brackets after each occurence of "permeable" counting up? ChatGPT: Sure, here's the sentence with a number in brackets after each occurrence of "permeable" counting up: "A permeable membrane is more permeable if it is [1] more permeable rather than less permeable than a less permeable membrane is [2] permeable." "Permeable" appears twice in the first clause and twice in the second clause of the sentence, so there are four occurrences of the word "permeable" in total.

GPT-4 will be a text-only model like GPT-3.

That was a very long time ago, and the current GPT-4, you agree, would've been trained well after. So we can't put too much weight on it.

The current reporting is pushing very heavily on multimodality, including the announcement by Microsoft Germany that it will be released next week & explicitly stating GPT-4 will do video, in addition to earlier reporting about images (as well as Altman musing about revenge porn).

he did say that GPT-4 would use a different loss function, what effect would that have? I have no idea.

One possibility is shifting the power law. See UL2 which combines the various denoising losses in what turns out to be a very good way: "U-PaLM: Transcending Scaling Laws with 0.1% Extra Compute", Tay et al 2022 - halving PaLM training requirements w/UL2 losses. I don't know if OA discovered UL2 first, but it's not all that exotic or subtle and is certainly something that many people ask themselves when they learn about the difference between bidirect... (read more)

So it's very possible (albeit unlikely) that the number of total GPUs used for GPT-4 training could be higher than 15000!

OAers have noted that the cluster has, of course, been expanded heavily since the original 10k (albeit not what it is now). Morgan Stanley is saying that GPT-5 is being trained right now on 25,000 GPUs, up heavily from the original 10k, and implying that 'most' of the GPT-5 GPUs were used for GPT-4 which finished 'some time ago'; the mean of 10 & 25 is 17.5, so >15k seems entirely possible, especially if those GPUs weren't just... (read more)

That doesn't sound too hard. Why does it have to generate a query's result? Why can't it just have a convention to 'write a well-formed query, and then immediately after, write the empty string if there is no response after the query where an automated tool ran out-of-band'? It generates a query, then always (if conditioned on just the query, as opposed to query+automatic-Internet-access-generated-response) generates "", and sees it generates "", and knows it didn't get an answer. I see nothing hard to learn about that.

The model could also simply note that... (read more)

These may be true, but it is unclear how they are relevant to explaining the recent trends and how they differ by groups. There is, and long has been, intense state & parental control of childrens' lives and often not for the better: but how does that explain a change in trends in 2011 to increase, prior decreases in the 1990s, experimental results like quitting social media (where parental/state oversight is minimal) apparently increasing mental health, or differences like 'liberal girls are more affected than conservative girls'?

1ztzuliios15d
Control is not a constant, and ability to effectively control depends on the social context. The state itself has acted as a counterweight to parental control for hundreds of years, and capital also acts as a counterweight -- if you don't want to live the way your parents want you to live or marry who they want you to marry, you can run away to the city and live free, which is easier if there are strong laws preventing you from being hunted down and honor-killed and jobs waiting for you in the urban center. Control was arguably at all-time lows in the late 60s and 70s. But the 80s are a period of reaction against these excesses, and safetyism can be argued to have started in the 80s. The first law mandating car seats is passed in 1979 and the first law mandating seat belts in 1984. More tellingly, the satanic ritual abuse panic kicks off hard in 1983 with the McMartin trial and the next year satanic ritual abuse panic advocates testify before Congress. Stranger danger spreads as a meme, reducing the ability of young people to travel freely via hitchhiking, even though the actual risk remains low. The Internet disrupts this control process by creating a new space where young people are more able to navigate than parents. I'd argue this is the cause of the decline through the 90s: increased freedom from the nascent Internet. Gradually, this is curtailed as BBSs become forums, and forums become social media. At the same time, censorship is productized and sold to parents, and as early as 2008 schools were having extracurricular brainwashing sessions designed to scare children away from using the Internet as a vehicle of expression because "what's on the Internet is forever." I do not understand how parental and state oversight could be said to be minimal on social media. Schools install spyware on their own devices and recommend parents do the same. "Parental control features" are ubiquitous and you can't crack your mom's password with a l0phtcrack CD because Windows

Think of it like AlphaGo - if it only ever could train itself by playing Go against actual humans, it would never have become superintelligent at Go.

This is obviously untrue in both the model-free and model-based RL senses. There are something like 30 million human Go players who can play a game in two hours. AlphaGo was trained on policy gradients from, as it happens, on the order of 30m games; so it could accumulate a similar order of games in under a day; the subset of pro games can be upweighted to provide most of the signal - and when they stop pro... (read more)

It didn't have to be revealed. That Quirrel was Voldemort was obvious almost within the first chapter introducing him (eg already taken for granted in the earliest top-level discussion page in 2010), to the extent that fandebate over 'Quirrelmort' was mostly "he's so obviously Voldemort, just like in canon - but surely Yudkowsky would never make it that easy, so could he possibly be someone or something else? An impostor? some sort of diary-like upload? a merger of Quirrel/Voldemort? Harry from the future?" Bragging about being the inspiration for a charac... (read more)

So the current on-LW version has always just been there, or something?

4mingyuan20d
Yup [https://web.archive.org/web/20171107003453/https://www.lesserwrong.com/hpmor]

If maintenance were the rationale (which is perfectly reasonable), they could have just dumped the static HTML and called it a day. It's not like the site changed or had any special interactive or dynamic aspects. (And such an archival scrape would likely have been substantially less effort than this special integration has been, and will be, I predict.)

No, you're doing it wrong, as I already explained. You're letting GPT fall back onto its policy by choosing any response. You need to force it out of its comfort zone. Ask it to explain a pun it did not write, or answer questions like whether a pair of words that you picked rhyme. Write pairs of new words that have never been seen before, etc. The task of 'come up with a memorized rhyme for reasonably common words' does not disprove extensive memorization or show that it has failed to understand the underlying phonetics.

Reading Dan Wang's belated letter, where he describes Shanghai and the abrupt collapse of Zero Covid, reminds me of one interesting aspect for us of base rates, Outside View reasoning, rocks which say 'everything is fine, and the difference between probabilities & decisions:

for a long time, it was obvious that Zero Covid was not working, especially once increasingly infectious variants meant that there was community spread & all the measures had failed to make r<<1 but was hovering at r=1, and they were burning through patience, time, and mon... (read more)

1norn22d
"I tried to explain to him about transfer learning starting to work back in 2015 or so (a phenomenon I regarded as extremely important and which has in fact become so dominant in DL we take it utterly for granted) and he denied it with the usual Hansonian rebuttals; or when he denied that DL could scale at all, he mostly just ignored the early scaling work I linked him to like Hestness et al 2017. Or consider Transformers: a lynchpin of his position was that algorithms have to be heavily tailored to every domain and problem they are applied to, like they were in ML at that time—an AlphaGo in DRL had nothing to do with a tree search chess engine, much less stuff like Markov random fields in NLP, and this was just a fact of nature, and Yudkowsky’s fanciful vaporing about ‘general algorithms’ or ‘general intelligence’ so much wishful thinking. Then Transformers+scaling hit and here we are…" I can't find where you've had this exchange with him - can you find it? If his embarrassing mistakes (and refusal to own up to them) is documented and demonstrable, why not just blam them onto his blog [https://www.overcomingbias.com/p/ai-risk-again] and twitter?

The claim that it reached 100 million users within two months has been reported by many news outlets, which all seem to bottom out in data from Similarweb. I was not able to find a detailed report, but it looks like they have more data behind a paywall. I think it’s reasonable to accept this claim for now, but, again, it might be different in some way from what the media is reporting1.

FWIW, there's an article somewhere quoting an OAer saying that the real number was more like half that.

Of course, what means is still highly ambiguous. Accounts which logg... (read more)

I think so. If someone could show that BPEs were changing the scaling laws on an important task end-users will pay for, then it wouldn't be hard to change that: for example, I noted that Codex induced OA to change BPEs, because that substantially increased the effective context window when you generate BPEs optimized for programming language syntax, which matters to big paying customers like Github (the larger the ctx, the more the variables & definitions inside a specific project are available for relevant completion). Otherwise, the general attitude ... (read more)

Load More