Same person as nostalgebraist2point0, but now I have my account back.


Wiki Contributions


Gato as the Dawn of Early AGI

Meta-comment of my own: I'm going to have to tap out of this conversation after this comment.  I appreciate that you're asking questions in good faith, and this isn't your fault, but I find this type of exchange stressful and tiring to conduct.

Specifically, I'm writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings.  It's tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.

Re: non-RL training data.

Above, I used "RL policies" as a casual synecdoche for "sources of Gato training data," for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.

Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).

Re: gains from transfer..

Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.

However, training a Gato is not the only way of reaping gains from transfer.  Every time we finetune any model, or use multi-task training, we are reaping gains from transfer.  The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.

So you can imagine a spectrum ranging from 

  1. "pretrain only on one very related task" (i.e. finetuning a single narrow task model), to 
  2. "pretraining on a collection of similar tasks" (i.e. multi-task pretraining followed by finetuning), to
  3. "pretrain on every task, even those where you expect no or negative transfer" (i.e. Gato)

The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.

It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers.  But this is not what the Gato authors found -- indeed it's basically the opposite of what they found.

We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it.  Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.

Oliver's post, which I basically I agree with, has more details on the transfer results.

Gato as the Dawn of Early AGI

I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?

Sure, this might happen.

But remember, to train "a Gato," we have to first train all the RL policies that generate its training data.  So we have access to all of them too.  Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it.  (Yes, related tasks transfer -- and the most related tasks transfer most!)

This approach doesn't have to spend any compute on the "train Gato" step before finetuning, which gives it a head start.  Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.

Would this work?  In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats).  In general, I don't know, but this is the baseline that Gato should be comparing itself against.

The question isn't "will it improve with scale?" -- it's 2022, anything worth doing improves with scale -- but "will it ever reach the Pareto frontier? will I ever have a reason to do it?"

As an ML practitioner, it feels like the paper is telling me, "hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?"  Like, uh, sure, but . . . why?

By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, "holy shit, this Fucking Works™, this moves the Pareto frontier." In several of these cases I went out and immediately used the method in the real world and reaped great rewards.

IMO, the "generalist agent" framing is misleading, insofar as it obscures this second-best quality of Gato.  It's not really any more an "agent" than my hypothetical cloud drive with a bunch of SOTA models on it.  Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive.  (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)

We are not talking about an autonomous thing; we're still in the world where there's a human practitioner and "Gato" is one method they can use or not use.  And I don't see why I would want to use it.

Gato as the Dawn of Early AGI

For what it's worth, I was thoroughly underwhelmed by Gato, to the point of feeling confused what the paper was even trying to demonstrate.

I'm not the only ML researcher who had this reaction.  In the Eleuther discord server, I said "i don't get what i'm supposed to take away from this gato paper," and responses from regulars included

  • "nothing, this was 3 years over-due"
  • "Yep. I didn't update much on this paper. I think the 'general' in the title is making people panic lol" (with two "this" reacts)

Or see this tweet.  I'm not trying to convince you by saying "lots of people agree with me!", but I think this may be useful context.

A key thing to remember when evaluating Gato is that it was trained on data from many RL models that were themselves very impressive.  So there are 2 very different questions we can ask:

  1. Does Gato successively distill a large number of learned RL policies into a single, small collection of params?
  2. Does Gato do anything except distillation?  Is there significant beneficial transfer between tasks or data types?  Is Gato any more of a "generalist agent" than, like, a big cloud storage bucket with all of those RL models in it, and a little script that lets you pick which one to load and run?

And the answers are a pretty clear, stark "yes" and "no," respectively.

For #2, note that every time the paper investigates transfer, it gets results that are mostly or entirely negative (see Figs 9 and 17).  For example, including stuff like text data makes Gato seem more sexily "generalist" but does not actually seem to help anything -- it's like uploading a (low-quality) LM to the same cloud bucket as the RL policies.  It just sits there.

In the particular case of the robot stacking experiment, I don't think your read is accurate, for reasons related to the above.  Neither the transfer to real robotics, nor the effectiveness of offline finetuning, are new to Gato -- the researchers are sticking as close as they can to what was done in Lee et al 2022, which used the same stacking task + offline finetuning + real robots, and getting (I think?) broadly similar results.  That is, this is yet another success of distillation, without a clear value-add beyond distillation.

In the specific case of Lee et al's "Skill Generalization" task, it's important to note that the "expert" line is not reflective of "SOTA RL expert models."

The stacking task is partitioned here (over object shapes/colors) into train and test subsets.  The "expert" is trained only on the train subset, and then Lee et al (and the Gato authors) investigate models that are additionally tuned on the test subset in some way or other.  So the "expert" is really a baseline here, and the task consists of trying to beat it.

(This distinction made somewhat clearer in an appendix of the Gato paper -- see Fig. 17, and note that the "expert" lines there match the "Dataset" lines from Fig. 3 in Lee et al 2022.)

larger language models may disappoint you [or, an eternally unfinished draft]

I've tried the method from that paper (typical sampling), and I wasn't hugely impressed with it.  In fact, it was worse than my usual sampler to a sufficient extent that users noticed the difference, and I switched back after a few days.  See this post and these tweets.

(My usual sampler one I came up with myself, called Breakruns.  It works the best in practice of any I've tried.)

I'm also not sure I really buy the argument behind typical sampling.  It seems to conflate "there are a lot of different ways the text could go from here" with "the text is about to get weird."  In practice, I noticed it would tend to do the latter at points where the former was true, like the start of a sample or of a new paragraph or section.

Deciding how you sample is really important for avoiding the repetition trap, but I haven't seen sampling tweaks yield meaningful gains outside of that area.

dalle2 comments

Hmm... what moral are you drawing from that result?

Apparently, CLIP text vectors are very distinguishable from CLIP image vectors.  I don't think this should be surprising.  Text vectors aren't actually expressing images, after all, they're expressing probability distributions over images.

They are more closely analogous to the outputs of a GPT model's final layer than they are to individual tokens from its vocab.  The output of GPT's final layer doesn't "look like" the embedding of a single token, nor should it.  Often the model wants to spread its probability mass across a range of alternatives.

Except even that analogy isn't quite right, because CLIP's image vectors aren't "images," either -- they're probability distributions over captions.  It's not obvious that distributions-over-captions would be a better type of input for your image generator than distributions-over-images.

Also note that CLIP has a trainable parameter, the "logit scale," which multiplies the cosine similarities before the softmax.  So the overall scale of the cosine similarities is arbitrary (as is their maximum value).  CLIP doesn't "need" the similarities to span any particular range.  A similarity value like 0.4 doesn't mean anything on its about about how close CLIP thinks the match is.  That's determined by (similarity * logit scale).

dalle2 comments

I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I'd predict them to be.

I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it.  It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice.  If OpenAI predicted them, then they know something I don't.

For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I'm having a hard time telling from what's written in the paper

Yes, that model did get to see a variable-length encoding of the caption.  As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.

Again, it's very mysterious that (GLIDE's pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictly more information to condition on!

(Low-confidence guess follows.  The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four "tokens," each a different projection of the vector.  [The same vector is also, separately, added to the model's more global "embedding" stream.]  In attention, there is competitive inhibition between looking at one position, and looking at another.  So, it's conceivable that the CLIP "tokens" are so information-rich that the attention fixates on them, ignoring the text-sequence tokens.  If so, it would ignore some information that GLIDE does not ignore.)

It's also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson's work:

Building on this observation, another approach would be to train the decoder to condition on CLIP text embeddings [9] instead of CLIP image embeddings

...but they never actually try this out in a head-to-head comparison.  For all we know, a model conditioned on CLIP text vectors, trained with GLIDE's scale and data, would do better than GLIDE and unCLIP.  Certainly nothing in the paper rules out this possibility.

dalle2 comments

Ah, I now realize that I was kind of misleading in the sentence you quoted.  (Sorry about that.)

I made it sound like CLIP was doing image compression.  And there are ML models that are trained, directly and literally to do image compression in a more familiar sense, trying to get the pixel values as close to the original as possible.  These are the image autoencoders.

DALLE-2 doesn't use an autoencoder, but many other popular image generators do, such as VQGAN and the original DALLE.

So for example, the original DALLE has an autoencoder component which can compress and decompress 256x256 images.  Its compressed representation is a 32x32 array, where each cell takes a discrete value from 8192 possible values.  This is 13 bits per cell (if you don't do any further compression like RLE on it), so you end up with 13 KiB per image.  And then DALLE "writes" in this code the same way GPT writes text.

CLIP, though, is not an autoencoder, because it never has to decompress its representation back into an image.  (That's what unCLIP does, but the CLIP encoding was not made "with the knowledge" that unCLIP would later come along and try to do this; CLIP was never encouraged to make is code to be especially suitable for this purpose.)

Instead, CLIP is trying to capture . . . "everything about an image that could be relevant to matching it with a caption."

In some sense this is just image compression, because in principle the caption could mention literally any property of the image.But lossy compression always has to choose something to sacrifice, and CLIP's priorities are very different from the compressors we're more familiar with.  They care about preserving pixel values, so they care a lot about details.  CLIP cares about matching with short (<= ~70 word) captions, so it cares almost entirely about high-level semantic features.

Playing with DALL·E 2

In general all writing I've seen is bad. I think this is less likely to be about safety, and more that it's hard to learn language by looking at a lot of images. However, since DE2 is trained on text, it clearly knows a lot about language at some level -- I would expect there's plenty of data to put out coherent text. Instead it outputs nonsense, focusing on getting the fonts and the background right.

It's definitely possible to get a diffusion model to write the text from a prompt into an image.  I made a model that does this late last year. (blogpost / example outputs*)

The text-conditioning mechanism (cross-attention) I use is a little different from the ones in GLIDE and DALLE-2, but I doubt this makes a huge difference.

I'm actually a little surprised that the OpenAI models don't learn to write coherent text, since they're bigger than mine, trained for longer on more data.

But then, I'm much more focused on this one specific capability, so I make it easy for the model: an entire ~50% of my training images have text in them, and the "prompt" in my setup always contains an automatic transcript of the text in the image (if any), never a description, or a description that happens to quote a transcript, or a description that merely summarizes the text, etc.

The OpenAI models have to solve a more abstract version of the problem, and the problem is relevant to (I would imagine) a much smaller fraction of their training examples.

*check the alt text if you want to know what text the model is attempting to write

[Link] Training Compute-Optimal Large Language Models

Thinking back to the "inconsistency" from the Kaplan et al papers...

  • In Appendix E of the new paper, we see the loss-vs-compute frontier start to "bend" from a straight line on a log-log plot, with returns to additional compute getting smaller at large scales.
  • I suspect this bending is the transition from the faster "L(C) law" to the slower "L(D) law."
    • A brief recap of that below:
      • Adding more params can help in two ways: it makes your model's loss decline toward its asymptotic minimum faster, and it can lower that minimum itself.
      • As models get bigger, the first effect dies off -- the loss curves converge to a fixed shape, rather than getter ever steeper.  The second effect keeps going, but with it alone, the overall rate of return is lower.
  • Presumably, the learning rate issue in Kaplan et. al. also affected their estimated L(D) law.
    • The issue made Kaplan et al underestimate optimal model performance.  The underestimate was worst when considering models for which the optimal number of training steps was small.
    • The L(D) law came from early stopping experiments.  The early stopping step is lower for smaller data sizes.
    • So the L(D) experiments with smaller D values look artificially bad, relative to the ones with large D values.  Thus the estimated L(D) curve declines faster than the true L(D) curve.
    • If this is correct, then L(D) improves more slowly with data than we had believed.
    • Note that this does contradict the "use more data!" result from the paper -- that is about the relative rate at which N and D affect L(N, D).
Load More