All of nostalgebraist's Comments + Replies

chinchilla's wild implications

What specific claims in the post do you disagree with?

See this post for why multiple epochs will probably not work nearly as well as training on additional data.

Two-year update on my personal AI timelines

Now I’m inclined to think that just automating most of the tasks in ML research and engineering -- enough to accelerate the pace of AI progress manyfold -- is sufficient.

This seems to assume that human labor is currently the limiting bottleneck in AI research, and by a large multiplicative factor.

That doesn't seem likely to me.  Compute is a nontrivial bottleneck even in many small-scale experiments, and in particular is a major bottleneck for research that pushes the envelope of scale, which is generally how new SOTA results and such get made these d... (read more)

chinchilla's wild implications

The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."

How good is that, though?  Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.

If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways.  Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on M... (read more)

7Vanessa Kosoy6d
Transformers a Turing complete [], so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.
chinchilla's wild implications

You're right, the idea that multiple epochs can't possibly help is one of the weakest links in the post.  Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first -- I'm kinda confused by this too.

After thinking more about it, I agree that the repeated-data papers don't provide much evidence that multiple epochs are harmful.

For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once... (read more)

Thanks, that's interesting... the odd thing about using a single epoch, or even two epochs, is that you're treating the data points differently. To extract as much knowledge as possible from each data point (to approach L(D)), there should be some optimal combination of pre-training and learning rate. The very first step, starting from random weights, presumably can't extract high level knowledge very well because the model is still trying to learn low level trends like word frequency. So if the first batch has valuable high level patterns and you never revisit it, it's effectively leaving data on the table. Maybe with a large enough model (or a large enough batch size?) this effect isn't too bad though.
chinchilla's wild implications

I'm wary of the assumption that we can judge "human ability" on a novel task X by observing performance after an hour of practice.

There are some tasks where performance improves with practice but plateaus within one hour.  I'm thinking of relatively easy video games.  Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies.  But most interesting things that humans "can do" take much longer to learn than this.

Here are some things that humans "can do," but require >> 1 hour of practi... (read more)

Ok, sounds like you're using "not too much data/time" in a different sense than I was thinking of; I suspect we don't disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.

chinchilla's wild implications

How come PaLM_opt is smaller than Chinchilla? Isn't Chinchilla supposed to be Gopher_opt?

See the footnote attached to that sentence.

These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down? 

Great question, with a complicated answer.

First, one of the assumptions you're making is not quite right.  By "trained differently" I imagine you're referring to a difference in learning rate schedules, since that was the fundamental diffe... (read more)

Is the difference mostly the learning rate schedule? I read it was also AdamW and it is at least conceivable that AdamW somehow gets better results for smaller models using more data but maxes out on the benefits of model size quicker than just plain Adam. So it could in theory be the case that scaling continues for the old scaling laws beyond what the new scaling laws say is possible, because Adam and AdamW just work differently enough. Of course that's not very plausible and for different learning rate schedules it is maybe even less plausible. Another way to phrase the question: Are the old and the new scaling laws roughly compatible? I.e. do the old scaling laws drop out of the new scaling laws if you use the old compute-optimal data/params distribution? I interpret your answer as that being roughly the case for the current models, but maybe not when you extrapolate further along the old scaling laws? If the old scaling laws are still correct for a fixed dataset with a correspondingly fixed learning rate schedule, then we can reasonably say that the new scaling laws show us where the old scaling would have hit a wall.
chinchilla's wild implications


Loss values are useful for comparing different models, but I don't recommend trying to interpret what they "mean" in an absolute sense.  There are various reasons for this.

One is that the "conversion rate" between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.

Early in training, when the model's progress looks like realizing "huh, the word 'the' is more common than some other words", these simple insights correspond to relatively large decreases in loss.  On... (read more)

A few points: 1. Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU). 2. In some ways, the models' ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.) 3. RNNs are much worse than transformers at in-context learning. It's not just a difference in generative text quality. See this study by DeepMind: []

That is, I suspect humans could be trained to perform very well, in the usual sense of "training" for humans where not too much data/time is necessary.


I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).

EDIT: These results are now posted here.

What changed with the transformer? To some extent, the transformer is really a "smarter" or "better" architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.

But also, it's feasible to scale transformers much bigger than we could scale the RNNs. You don't see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.

You might be interested in looking at the progress being made on the RWKV-LM architecture, if you aren't following it. It's an attempt to train an RNN like a transformer. Initial numbers look pretty good.

Based on the language modeling game that Redwood made, it seems like humans are much worse than models at next word prediction (maybe around the performance of a 12-layer model)

chinchilla's wild implications

Very interesting!

There are a few things in the calculation that seem wrong to me:

  • If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
  • I'd expect much less than 100% of Youtube video time to contain speech.  I don't know what a reasonable discount for this would be, though.
  • In the opposite direction, 1% useful seems too low.  IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.

I... (read more)

chinchilla's wild implications

Hmm, yeah, I phrased that point really badly.  I'll go back and rewrite it.

A clearer version of the sentence might read:

"Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size."

For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size.  This is what I meant when I said that PaLM only got a "small" boost from its larger size.

EDIT: rewrote and expanded this part of the post.

chinchilla's wild implications

It should work now, sorry about that.

chinchilla's wild implications

I disagree, but I'm not sure how relevant my opinion is, since I'm far less worried about "AGI ruin" to begin with than the median LWer.  That said, here's my thinking:

First, there's no universally agreed-upon line between "discussing whether the analysis has merits" and "giving the capabilities people free ideas."  Where a person draws this line depends on how obvious they think the ideas are, or how obvious they think they will be to the capabilities people.

Second, there are costs to not talking about things.  It's useful for alignment res... (read more)

People would ask things like "what would it cost (in compute spending) to train a 10T parameter Chinchilla?", which is a bizarre way to frame things if you grok what Chinchilla is.

That wasn't an alignment researcher, though (was it? I thought Tomás was just an interested commenter), and it's a reasonable question to ask when no one's run the numbers, and when you get an answer like 'well, it'd take something like >5000x more compute than PaLM', that's a lesson learned.

At least among the people I've talked to, it seems reasonably well understood that ... (read more)

What are the public domain internet places where one can learn more about capabilities, or see discussions of capabilities frameworks? Here's what I'm aware of:

  • LessWrong
  • Twitter (but specifically who idk, I avoid Twitter for mental health reasons.)
  • ArXiv (comp-sci)
  • YouTube: MLST, Two Minute Papers, Yannic Kilcher, some conference talks
  • A little bit on reddit (r/mlscaling, u/gwern)

All-in-all, there's not that much heavy discussion online. I've been told that these discussions really happen in-person, in the Bay Area and in DeepMind London offices. LessW... (read more)

chinchilla's wild implications

This distinction exists in general, but it's irrelevant when training sufficiently large LMs.

It is well-established that repeating data during large LM training is not a good practice.  Depending on the model size and the amount of repeating, one finds that it is either

  1. a suboptimal use of compute (relative to training a bigger model for 1 epoch), or 
  2. actively harmful, as measured by test loss or loss on out-of-distribution data

with (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and ... (read more)

So do you think, once we get to the point where essentially all new language models are trained on essentially all existing language data, it will always be more compute efficient to increase the size of the model rather than train for a second epoch? This would seem very unintuitive and is not directly addressed by the papers you linked in footnote 11, which deal with small portions of the dataset betting repeated.
7Tao Lin10d
This paper is very unrepresentative - it seems to test 1 vs 64-1,000,000 repeats of data, not 1 vs 2-10 repeats as you would use in practice

I think it would be a great follow-up post to explain why you think repeating data is not going to be the easy way out for the scaling enthusiasts at Deepmind and OpenAI.

I find the Figure 4 discussion at your first link quite confusing. They study repeated data i.e. disbalanced datasets to then draw conclusions about repeating data i.e. training for several epochs. The performance hit they observe seems to not be massive (when talking about scaling a couple of OOMs) and they keep the number of training tokens constant.

I really can't tell how this informs me about what would happen if somebody tried to scale compute 1000-fold and had to repeat data to do it compute-optimally, which seems to be the relevant question.

I can't access the wand link, maybe you have to change the access rules I was interested in the report on fine-tuning a model for more than 1 epoch, even though finetuning is obviously not the same as training.
chinchilla's wild implications

I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with.

I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.

If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior.   The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).

All that text was already there in the... (read more)

6clone of saturn4d
Language models seem to do a pretty good job at judging text "quality" in a way that agrees with humans. And of course, they're good at generating new text. Could it be useful for a model to generate a bunch of output, filter it for quality by its own judgment, and then continue training on its own output? If so, would it be possible to "bootstrap" arbitrary amounts of extra training data?
3Evan R. Murphy11d
We may be running up against text data limits on the public web. But the big data companies got that name for a reason. If they can tap into the data of a Gmail, Facebook Messenger or YouTube then they will find tons of more fuel for their generative models.
It seems to me that the key to human intelligence is nothing like what LMs do anyway; we don't just correlate vast quantities of text tokens. They have meanings. That is, words correlate to objects in our world model, learned through lived experience, and sentences correspond to claims about how those objects are related to one another or are changing. Without being rooted in sensory, and perhaps even motor, experience, I don't think general intelligence can be achieved. Language by itself can only go so far.
chinchilla's wild implications

When you say "irreducible", does that mean "irreducible under current techniques" or "mathematically irreducible", or something else?

Closer to the former, and even more restrictive: "irreducible with this type of model, trained in this fashion on this data distribution."

Because language is a communication channel, there is presumably also some nonzero lower bound on the loss that any language model could ever achieve.  This is different from the "irreducible" term here, and presumably lower than it, although little is known about this issue.

Do we have

... (read more)
Do we have a sense of what the level of loss is in the human brain? If I’m understanding correctly, if the amount of loss in a model is known to be finitely large, then will be incapable of perfectly modeling the world on principle (implying that to such a model physics is non-computable?)
Humans provide an untapped wealth of evidence about alignment

I don't have anything especially insightful to contribute, but I wanted to thank you (TurnTrout and Quinton) for this post.  I agree with it, and I often find myself thinking things like this when I read alignment posts by others on LW/AF.

When people present frameworks for thinking about AGIs or generic "intelligent agents," I often want to ask them: "are humans expressible in your framework?"  Often it seems like the answer is "no."

And a common symptom of this is that the framework cannot express entities with human-level capabilities that are a... (read more)

My Most Likely Reason to Die Young is AI X-Risk

I think your spreadsheet's calculation is not quite right.

Your column "Cum Prob Death (Natural)" is computed correctly.  For each marginal increment, you take the probability of natural death at that specific age ("Prob Death (Natural)"), and discount it by the probability of survival until that specific age.

However, you don't discount like this when computing "Cum Prob Death (AGI)."  So it includes probability mass from timelines where you've already died before year Y, and treats you as "dying from AGI in year Y" in some of these timelines.

Once... (read more)

Re cumulative probability calculations, I just copied the non-cumulative probabilities column from Ajeya Cotra's spreadsheet [] , where she defines it as the difference between successive cumulative probabilities (I haven't dug deeply enough to know whether she calculates cumulative probabilities correctly). Either way, it makes fairly little difference, given how small the numbers are. Re your second point, I basically agree that you should not work on AI Safety from a personal expected utility standpoint, as I address in the caveats. My main crux for this is just the marginal impact of any one person is miniscule. Though I do think that dying young is significantly worse than dying old, just in terms of QALY loss - if I avoid dying of Alzheimers, something will kill me soon after, but if I avoid dying in a bus today, I probably have a good 60 years left. I haven't run the numbers, but expect that it does notably reduce life expectancy for a young person today. My goal was just to demonstrate that AI Safety is a real and pressing problem for people alive today, and that discussion around longtermism elides this, in a way that I think is misleading and harmful. And I think 'most likely reason for me to die young' is an emotionally visceral way to demonstrate that. The underlying point is just kind of obvious if you buy the claims in the reports, and so my goal here is not to give a logical argument for it, just to try driving that point home in a different way.
why assume AGIs will optimize for fixed goals?

because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future

I personally doubt that this is true, which is maybe the crux here.

This seems like a possibly common assumption, and I'd like to see a more fleshed-out argument for it.  I remember Scott making this same assumption in a recent conversation:

I agree humans aren’t like that, and that this is surprising.

Maybe this is because humans aren’t real consequentialists, they’re perceptual control theory agents trying to satisfy finite drives? [

... (read more)
Why no alignment guarantees and why modify yourself and not build separately? The concern is that even if a non-coherent AGI solves its own alignment problem correctly, builds an EU-maximizing superintelligence aligned with the non-coherent AGI, the utility of the resulting superintelligence is still not aligned with humanity. So the less convenient question should be, "Would you build a coherent optimizer if you had all the alignment guarantees you would want, all the time in the world to make sure it's done right?" A positive answer to that question given by first non-coherent AGIs supports relevance of coherent optimizers and their alignment.
* When you say that coherent optimizers are doing some bad thing, do you imply that it would always be a bad decision for the AI to make the goal stable? But wouldn't it heavily depend on what other options it thinks it has, and in some cases maybe worth the shot? If such a decision problem is presented to the AI even once, it doesn't seem good. * The stability of the value function seems like something multidimensional, so perhaps it doesn't immediately turn into a 100% hardcore explicit optimizer forever, but there is at least some stabilization. In particular, bottom-up signals that change the value function most drastically may be blocked. * AI can make its value function more stable to external changes, but it can also make it more malleable internally to partially compensate for Goodharting. The end result for outside actors though is that it only gets harder to change anything. * Edit: BTW, I've read some LW articles on Goodharting but I'm also not yet convinced it will be such a huge problem at superhuman capability levels - seems uncertain to me. Some factors may make it worse as you get there (complexity of the domain, dimensionality of the space of solutions), and some factors may make it better (the better you model the world, the better you can optimize for the true target). For instance, as the model gets smarter, the problems from your examples seem to be eliminated: in 1, it would optimize end-to-end, and in 2, the quality of the decisions would grow (if the model had access to the ground truth value function all along, then it would grow because of better world models and better tree search for decision-making). If the model has to check-in and use feedback from the external process (human values) to not stray off course, then as it's smarter it's discovering a more efficient way to collect the feedback, has better priors, etc.
This is a really high-quality comment, and I hope that at least some expert can take the time to either convincingly argue against it, or help confirm it somehow.
One possible reconciliation: outer optimizers converge on building more coherent inner optimizers because the outer objective is only over a restricted domain, and making the coherent inner optimizer not blow up inside that domain is much much easier than making it not blow up at all, and potentially easier than just learning all the adaptations to do the thing. Concretely, for instance, with SGD, the restricted domain is the training distribution, and getting your coherent optimizer to act nice on the training distribution isn't that hard, the hard part of fully aligning it is getting from objectives that shake out as [act nice on the training distribution but then kill everyone when you get a chance] to an objective that's actually aligned, and SGD doesn't really care about the hard part.
why assume AGIs will optimize for fixed goals?

Sure.  Although before I do, I want to qualify the quoted claim a bit.

When I say "our goals change over time," I don't mean "we behave something like EU maximizers with time-dependent utility functions."  I think we don't behave like EU maximizers, in the sense of having some high-level preference function that all our behavior flows from in a top-down manner.

If we often make choices that are rational in a decision-theoretic sense (given some assumption about the preferences we are trying to satisfy), we are doing so via a "subgoal capacity." &nb... (read more)

Aha, this seems somewhat cruxy, because the things you list as examples of human goals are mostly about values, which I agree act in a sort of weird way, whereas I would see maintenance of homeostasis as a more central example of human goals. And while maintaining homeostasis isn't 100% aligned with the concept of utility maximization, it does seem to be a lot more aligned than values. With respect to maintaining homeostasis, it can be a bit unclear what exactly the utility function is. The obvious possibility would be "homeostasis" or "survival" or something like that, but this is slightly iffy in two directions. First, because strictly speaking we maintain homeostasis based on certain proxies, so in a sense the proxies are what we more strictly optimize. But this can also be fit into the EU framework in another way, with the proxies representing part of the mechanism for how the expectation of utility is operationalised. And second, because maintaining homeostasis is again just a proxy for other goals that evolution has, namely because it grants power to engage in reproduction and kin altruism. And this doesn't fit super neatly into classical EU frameworks, but it does fit neatly into later rationalist developments like outer/mesa-optimization. So basically, "homeostasis" is kind of fuzzy in how it relates to EU-style maximization, but it does also sort of fit, and I think it fits much better than values do: * A human has a goal of maintaining homeostasis * The goal is a fixed part of the human's structure. The internal dynamics of the human, if left to their own devices, will never modify the goal. * The "outermost loop" of the human's internal dynamics is an optimization process aimed at maintaining homeostasis, or at least the human behaves just as though this were true. * This "outermost loop" or "fixed-homeostasis-directed wrapper" chooses which of the human's specific capabilities to deploy at any given time, and how to deploy it. *
Gato as the Dawn of Early AGI

Meta-comment of my own: I'm going to have to tap out of this conversation after this comment.  I appreciate that you're asking questions in good faith, and this isn't your fault, but I find this type of exchange stressful and tiring to conduct.

Specifically, I'm writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings.  It's tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words ... (read more)

Gato as the Dawn of Early AGI

I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?

Sure, this might happen.

But remember, to train "a Gato," we have to first train all the RL policies that generate its training data.  So we have access to all of them too.  Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it.  (Yes, related tasks tr... (read more)

But remember, to train "a Gato," we have to first train all the RL policies that generate its training data. So we have access to all of them too.

No, you don't have to, nor do you have guaranteed access, nor would you necessarily want to use them rather than Gato if you did. As Daniel points out, this is obviously untrue of all of the datasets it's simply doing self-supervised learning on (how did we 'train the RL policy' for photographs?). It is also not true of it because it's off-policy and offline: the experts could be human, or they could be the o... (read more)

5Daniel Kokotajlo3mo
I thought some of the "experts" Gato was trained on were not from-scratch models but rather humans -- e.g. images and text generated by humans. Relatedly, instead of using a model as the "expert" couldn't you use a human demonstrator? Like, suppose you are training it to control a drone flying through a warehouse. Couldn't you have humans fly the drones for a bit and then have it train on those demonstrations? This is false if significant transfer/generalization starts to happen, right? A drive full of a bunch of SOTA models, plus a rule for deciding what to use, is worse than Gato to the extent that Gato is able to generalize few-shot or zero-shot to new tasks and/or insofar as Gato gets gains from transfer. EDIT: Meta-comment: I think we are partially just talking past each other here. For example, you think that the question is 'will it ever reach the Pareto frontier,' which is definitely not the question I care about.
Gato as the Dawn of Early AGI

For what it's worth, I was thoroughly underwhelmed by Gato, to the point of feeling confused what the paper was even trying to demonstrate.

I'm not the only ML researcher who had this reaction.  In the Eleuther discord server, I said "i don't get what i'm supposed to take away from this gato paper," and responses from regulars included

  • "nothing, this was 3 years over-due"
  • "Yep. I didn't update much on this paper. I think the 'general' in the title is making people panic lol" (with two "this" reacts)

Or see this tweet.  I'm not trying to convince you ... (read more)

2Lorenzo Rex3mo
A single network is solving 600 different tasks spanning different areas. 100+ of the tasks are solved at 100% human performance. Let that sink in. While not a breaktrough in arbitrary scalable generality, the fact that so many tasks can be fitted into one architecture is surprising and novel. For many real life applications, being good in 100-1000 tasks makes an AI general enough to be deployed as an error tollerant robot, say in a warehouse. The main point imho is that this architecture may be enough to be scaled (10-1000x parameters) in few years to a useful proto-AGI product.
8Daniel Kokotajlo3mo
FWIW I agree with this take & basically said as much in my post; Gato is about what I would have expected given past progress. I think people are right to freak out now about oncoming AGI, but I think they should have been freaking out already, and Gato just had a sufficiently sexy title and abstract. It's like how people should have been freaking out about COVID early on but only actually started freaking out when hospitals started getting crowded in their own country. As for the transfer, I would actually have been a bit surprised if there was significant positive transfer given the small number of tasks trained on and the small model size. I'm curious to hear if there was negative transfer though and if so how much. I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
larger language models may disappoint you [or, an eternally unfinished draft]

I've tried the method from that paper (typical sampling), and I wasn't hugely impressed with it.  In fact, it was worse than my usual sampler to a sufficient extent that users noticed the difference, and I switched back after a few days.  See this post and these tweets.

(My usual sampler one I came up with myself, called Breakruns.  It works the best in practice of any I've tried.)

I'm also not sure I really buy the argument behind typical sampling.  It seems to conflate "there are a lot of different ways the text could go from here" with... (read more)

Very comprehensive, thank you!
dalle2 comments

Hmm... what moral are you drawing from that result?

Apparently, CLIP text vectors are very distinguishable from CLIP image vectors.  I don't think this should be surprising.  Text vectors aren't actually expressing images, after all, they're expressing probability distributions over images.

They are more closely analogous to the outputs of a GPT model's final layer than they are to individual tokens from its vocab.  The output of GPT's final layer doesn't "look like" the embedding of a single token, nor should it.  Often the model wants t... (read more)

dalle2 comments

I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I'd predict them to be.

I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it.  It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice.  If OpenAI predicted them, then they know something I don't.

For instance, it seems like maybe the model that produced the roses on

... (read more)
dalle2 comments

Ah, I now realize that I was kind of misleading in the sentence you quoted.  (Sorry about that.)

I made it sound like CLIP was doing image compression.  And there are ML models that are trained, directly and literally to do image compression in a more familiar sense, trying to get the pixel values as close to the original as possible.  These are the image autoencoders.

DALLE-2 doesn't use an autoencoder, but many other popular image generators do, such as VQGAN and the original DALLE.

So for example, the original DALLE has an autoencoder compon... (read more)

Playing with DALL·E 2

In general all writing I've seen is bad. I think this is less likely to be about safety, and more that it's hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level -- I would expect there's plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.

It's definitely possible to get a diffusion model to write the text from a prompt into an image.  I made a model that does this late last year. (blogpost... (read more)

[Link] Training Compute-Optimal Large Language Models

Thinking back to the "inconsistency" from the Kaplan et al papers...

  • In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
  • I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
    • A brief recap of that below:
      • Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
      • As models get bigger, the first
... (read more)
[Link] Training Compute-Optimal Large Language Models

If you just want a big parameter-count to wave around, you use a MoE like everyone else optimizing for clickbait. (Or even better, use n-grams so you can talk about having hundreds of trillions of parameters. It'll work awful compared to NNs, but you'll have the most parameters!)

On that note, did you see that recent Chinese MoE with 174T params, 3 layers, and 96000 experts?

I saw that it was just a tech demo (like DeepSpeed training 1t-dense models for a few steps), and put it on my reading-list. [] suggests they're serious about using supercomputer-scale computers but they haven't done so or invested as much compute as Baidu with ERNIE Titan []) but looks like not yet, and so not a major priority compared to trying to read all the papers on trained models...* (One reason I am skeptical of MoEs is that for all the Chinese investment into them, nobody seems to report much interesting output from the models, while it seems like anyone who tinkers with the largest dense models will stumble over something like inner-monologues. Do their users show a terminal lack of imagination, are just none of them at all getting translated or included in the papers, or are MoEs just not that great?) * Even before Chinchilla, it was obvious that training a 1t, much less 100t, dense model to converged/compute-optimal performance, is far harder than demonstrating you can train such a model for a step or two. Similarly for MoEs: if you can train a 100t-parameter MoE to converged/compute-optimal, my suspicion is that you probably shouldn't've bothered in the first place because if a 100t MoE trainable with a contemporary amount of FLOPS is the answer, then the question must be a bad one.
[Link] Training Compute-Optimal Large Language Models

It ought to shorten actual timelines, for the reason you say.  (Except insofar as data sourcing could actually become a practical problem.)

However, it lengthens the Bio Anchors timeline, because the parameter count in Bio Anchors is fixed.  (It's the parameter count of a model that uses about as much inference compute as the brain.)

This is a weird thing about Bio Anchors -- it asks when models will cross a threshold for the compute required to run them, so efficiency improvements of various kinds will lengthen its timeline.  It's always wait... (read more)

It Looks Like You're Trying To Take Over The World

I don't agree with your read of the MuZero paper.

The training routine of MuZero (and AlphaZero etc) uses explicit tree search as a source of better policies than the one the model currently spits out, and the model is adapted to output these better policies.

The model is trying to predict the output of the explicit tree search.  There's room to argue over whether or not it "learns implicit tree search" (ie learns to actually "run a search" internally in some sense), but certainly the possibility is not precluded by the presence of the explicit search; ... (read more)

You're at most making the claim that MuZero attempts to learn tree search. Does the MuZero paper provide any evidence that MuZero in fact does implicit tree search? I think not, which means it's still misleading to link to that paper while claiming it shows neural nets can learn implicit tree search (I don't particularly doubt the can learn it a bit, but I do contest the implication that MuZero does so to any substantial degree or that a non-negligible part of its strength comes from learning implicit tree search). Edit: I should clarify what would change my mind here. If someone could show that MuZero (or any scaled-up variant of it) can beat humans at Go with the neural-net model alone (without the explicit tree search on top), I would change my mind. To my knowledge, no paper is currently claiming this, but let me know if I am wrong. Since my understanding is that the neural nets alone cannot beat humans, my interpretation is that the neural net part is providing something like roughly human-level "intuition" about what the right move should be, but without any actual search, so humans can still outperform this intuition machine by doing explicit search; but once you add on the tree search, the machines crush humans due to their speed.
It Looks Like You're Trying To Take Over The World

I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).

Like, I've read and re-read the first few sections a number of times, and I still can't come up with a mental model of HXU's structure that fits all of the described facts.  By "HXU's structure" I mean things like:

  • The researcher is running an "evolutionary search in auto-ML" method.  How many nested layers of inner/outer loop does this method (explicitly) contain?
  • Where in the nested structure
... (read more)
Relatedly, the story does the gish-gallop thing where many of the links do not actually support the claim they are called on to support. For example, in "learning implicit tree search à la MuZero", the link to MuZero does not support the claim that MuZero learns implicit tree search. (Originally the link directed to the MuZero paper, which definitely does not do implicit tree search, since it has explicit tree search hard-coded in; now the link goes to gwern's page on MuZero, which a collection of many papers and it is unclear which one is about learning to do implicit tree search. Note that as far as I know, every Go program that can beat humans has tree search explicitly built in, so implicit tree search is not really a thing.)
What's Up With Confusingly Pervasive Consequentialism?

I'm having some trouble phrasing this comment clearly, and I'm also not sure how relevant it is to the post except that the post inspired the thoughts, so bear with me...

It seems important to distinguish between several things that could vary with time, over the course of a plan or policy:

  1. What information is known.
    • This is related to Nate's comment here: it is much more computationally feasible to specify a plan/policy if it's allowed to contain terms that say "make an observation, then run this function on it to decide the next step," rather than writing o
... (read more)
larger language models may disappoint you [or, an eternally unfinished draft]

There is some point at which it’s gaining a given capability for the first time though, right? [...]

So my read of the de-noising argument is that at current scaling margins we shouldn’t expect new capabilities—is that correct?

Not quite.

If you define some capability in a binary yes-no way, where it either "has it" or "doesn't have it" -- then yes, there are models that "have it" and those that "don't," and there is some scale where models start "having it."

But this apparent "switch flip" is almost always an artifact of the map, not a part of the territory.

S... (read more)

GPT-3: a disappointing paper

The later post still reiterates the main claims from this post, though.

  • This post: "Few-shot learning results are philosophically confusing and numerically unimpressive; the GPT-3 paper was largely a collection of few-shot learning results, therefore the paper was disappointing"
  • The later post: "Few-shot learning results are philosophically confusing and numerically unimpressive; therefore we don't understand GPT-3's capabilities well and should use more 'ecological' methods instead"

Many commenters on this post disagreed with the part that both posts share ("Few-shot learning results are philosophically confusing and numerically unimpressive").

GPT-3: a disappointing paper

This post holds up well in hindsight.  I still endorse most of the critiques here, and the ones I don't endorse are relatively unimportant.  Insofar as we have new evidence, I think it tends to support the claims here.

In particular:

  • Framing few-shot learning as "meta-learning" has caused a lot of confusion.  This framing made little sense to begin with, for the reasons I note in this post, and there is now some additional evidence against it.
  • The paper does very little to push the envelope of what is possible in NLP, even though GPT-3 is proba
... (read more)
If you feel like "larger language models may disappoint you" was one of the posts that reiterated your arguments here, they seem to be saying pretty different things to me? It feels like this article is fundamentally focused on talking about the GPT-3 paper whereas your later post is focused on talking about GPT-3 itself.
EfficientZero: How It Works

I don't think humans would put much probability on such sequences, even conditionally: we'd think that at some point the sequence would stop, because why would there be such gibberish?

I think the intuition behind your remark "why would there be such gibberish?" actually goes most of the way to explaining the repetition trap.

The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.  

Once you're already in one, I don't think it's really so obvious that the repetit... (read more)

I think that's a possible interpretation. I'm still not sure why it wouldn't affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the 'misstep' estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too? It doesn't really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don't think it's perfect, and it's clearly a hack which doesn't fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.) I've seen hundreds of iGPT completions and random examples, and not a single one ever just 'starts repeating' ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the 'tessellation' that naive CLIP maximization causes is extremely obvious, and you can't sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I'm wrong and there is repetition at some level which isn't obvious to the naked eye, like high-frequency Fourier components (although I'm not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I'll change my mind but it's been years now, so I
Hard-Coding Neural Computation

I'm confused by your notation for feed-forward layers.

What justifies re-using the same labels ("apple" etc.) for

  1. the coordinates of  
  2. the coordinates of , i.e. the basis in which the nonlinearity operates


If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by , or which vectors/semes they get mapped to by .

But your labels don't correspond to either of these interpretations.  Instead, it looks like you are foll... (read more)

Thanks for your comments/questions, they're very insightful. In general, there are as many encoding spaces in a Transformer as there are computational nodes, and a traditional Transformer will have little incentive to use the same semantics for any two of the spaces. (There's a little bit of an incentive because of the residual connections, which will (I think?) kind of tie the semantics of the various hidden-size-sized embeddings spaces.) In particular, the middle layer of the dense-relu-dense feedforward layer is usually chosen to be significantly larger (4x) than the hidden size, and so it's not even theoretically possible to represent it using the same basis. I've found that it sometimes makes sense to use anonymous seme names like x1 x2 x3 etc in the feed-forward layer for this reason. In my experience so far I've found the feed-forward layers to be most useful for conjunctions and disjunctions - and there are a quadratic number of possible conjunctions and disjunctions of even two neurons, let alone 3 or 4. So it seems to me that this might give a tiny hint as to why people have found that the intermediate embedding space of the feed-forward layer needs to be so large. Of course, there is a potentially huge gap between what I am clever enough to think of as a use for them and what good old gradient descent is clever enough to think of. We can only easily lower-bound the potential uses of them; upper-bounding the capabilities of a component will prove much more challenging.
More Christiano, Cotra, and Yudkowsky on AI progress

I agree with Eliezer's recommendation to double-check results in papers that one finds surprising.

So, I looked into the claim of a 10x - 100x gain for transformers, using Table 2 from the paper.  Detailed results are in this Colab.

Briefly, I don't think the claim of 10x - 100x is well supported.  Depending on what exactly you compute, you get anywhere from "no speedup" to "over 300x speedup."  All the estimates you can make have obvious problems, and all show a massive gap between French and German.

In detail:

  • The appearance of a large speedup
... (read more)
larger language models may disappoint you [or, an eternally unfinished draft]

It will still only provide a lower bound, yes, but only in the trivial sense that presence is easier to demonstrate than absence.

All experiments that try to assess a capability suffer from this type of directional error, even prototype cases like "giving someone a free-response math test."

  • They know the material, yet they fail the test: easy to imagine (say, if they are preoccupied by some unexpected life event)
  • They don't know the material, yet they ace the test: requires an astronomically unlikely coincidence

The distinction I'm meaning to draw is not that ... (read more)

Is GPT-3 already sample-efficient?

I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 "Curie."

In my experience, doing more than a single epoch is always harmful when finetuning GPT-J. 

I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule.  I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect.  Training beyond the first epoch only helped on text ... (read more)

I cannot access your wandb, btw. It seems to be private.
If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?
larger language models may disappoint you [or, an eternally unfinished draft]

I'm glad you liked the post!  And, given that you are an avowed "enthusiast," I'm pleasantly surprised that we agree about as many things as we do.

The second [source of discontinuous performance scaling] is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.

Thanks for pointing out this argument -- I hadn't thought about it before.  A few thoughts:

Ordinary text generation is also a multi-step process.  (The token length generally isn't ... (read more)

One reason we agree on many object-level facts but have different takeaways is that we have different desiderata for what GPT is supposed to do in the limit. I agree that many of the problems you discuss are fundamental to the way GPT is trained and how it works, but I generally feel these problems don't need to be solved directly in order to use GPT to build AGI. I see GPT as the _seed_ for a future AGI system built off of or around it. I see the big crux is how much "compressed memorization" will extrapolate to general intelligence vs. begin to show cracks as we ask it for more and more advanced and general one-step deductions. It would be worth coming up with some specific claims about how we expect future systems to act to differentiate our two perspectives (including at the level of internals). Probably this is useful to start on my end because I have higher expectations for performance. Unfortunately I'm very adverse to talking about _how_ I would amplify GPT by extending it or wrapping it in a larger system, and I see steps like that as key to unlocking its capabilities. Your idea about multi-step deduction happening over multiple layers makes a lot of sense. You brought up an experiment in the Eleuther discord I think would be a great idea to try. We could train several models to see if tasks that require a sequence of discrete steps are unusually sensitive to network depth rather than scaling with parameter count alone. I agree about your insights about abstract reasoning as babble and prune, although this definitely isn't the only way I reason abstractly. I babble and prune especially when I am writing (on the word/sentence/paragraph level), and I babble and prune as a part of the search process when I am trying to come up with a plan or navigate through a math proof. But when I am talking I am able to fluidly reason towards my goal with little to no plan ahead of ahead of time. I work collaboratively so much of my abstract thinking is out loud. If babb
Visible Thoughts Project and Bounty Announcement

Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.

I've fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn't exist yet).

Below I list some key things to note.  Also see here for related discussion.  These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently.

  • GPT performance tends to sc
... (read more)
larger language models may disappoint you [or, an eternally unfinished draft]

I'd predict that trying to infer the necessary prompt with the reversing trick wouldn't work for small models anyhow, and would be a waste of time compared to directly editing/controlling the model.

Also, even if one had a reversed model available, it would not be trivial to generate useful prompts with it.

The goal is (roughly) to find a prompt that maximizes .  But the reversed model gives you 

The answe... (read more)

My intuition is that it would Just Work for large smart models, which are easy to prompt and few-shot well: like the 'French->English' prompt, if you reversed that and you kept getting 'French->English' as your prompt across multiple examples, well, that's obviously a good prompt, whatever the objective likelihood of text starting with 'French->English' may be. And to the extent that it didn't work, it would either be relatively 'obvious' what the reversed outputs have in common so you could try to come up with a generic prompt ("who was the X State senator for the nth district in the year MMMM?"), or it would at least be a good starting point for the gradient ascent-type procedures (in the same way that with GAN projectors you often treat it as a hybrid: a quick projection of amortized computation into roughly the right latent z and then gradient ascent to clean it up) and you get a bunch of reversed prompts as seeds for optimization to find the best one or maximize diversity for some other reason.
larger language models may disappoint you [or, an eternally unfinished draft]

I don't know.

I poked around on Google Scholar for a bit, trying to answer these questions, and managed to learn the following:

  • The term "few-shot learning" seems to have become widespread sometime around 2017.
  • Before 2017, it's hard to find usage of "few-shot" but easy to find usage of "one-shot."  (Example, example)
    • The "one-shot" term dates back at least as far as 2003.  Work before 2010 tends to lump what we would call "one-shot" and "few-shot" into a single category, as in
... (read more)
1Conor Sullivan8mo
Thanks for looking into it! It's really interesting to see computer vision research from before the deep learning revolution.
larger language models may disappoint you [or, an eternally unfinished draft]

There is no known way to "reverse" an LM like that.

(Well, besides the brute force method, where you generate a preceding token by looping over all possible values for that token.  GPT's vocab has ~50k tokens, so this is ~50k slower than forwared sampling.)

There are some LMs that naturally work in both directions.  Namely, masked language models (eg BERT), as opposed to causal language models (eg GPT).  Rather than taking a substring as input, masked language models take a complete string, but with some positions randomly masked or corrupted,... (read more)

larger language models may disappoint you [or, an eternally unfinished draft]

When I talk about self-contradiction in this post, I'm talking about the model contradicting itself in the span of a single context window.  In other words, when the contradicting fact is "within the last n tokens."

Aha, thanks for clarifying this; was going to ask this too. :)
My experience at and around MIRI and CFAR (inspired by Zoe Curzi's writeup of experiences at Leverage)

But on a set of very reasonable priors, we would expect your most meaningful and spiritually significant head-moment to be correlated with and causally linked to some kind of unusual thing happening outside your head.  An activity, an interaction with other people, a novel observation.

This doesn't feel plausible at all to me.  (This is one of two key places where I disagree with your framing)

Like, this is a huge category: "experiences that don't involve anything unusual happening around you."  It includes virtually all of the thinking we do ... (read more)

I just can't get past what reads to me as tremendous typical mind fallacy in this comment? Like, I think I would just straightforwardly agree with you, if you had caveatted that you were talking about LWers exclusively, or something similar. But the whole thing above seems to think it's not about, I dunno, a normal curve of people centered on IQ 125 or something. So much of what you're arguing falls apart once you look at the set of humans instead of the set of [fiction writers + artists + theorem provers + introspecters + people who do any kind of deliberate or active thinking at all on the regular]. As for the second bit: I'm not modeling human minds as having mostly-coherent beliefs or a single global map.
Load More