Same person as nostalgebraist2point0, but now I have my account back.
Elsewhere:
Sure. Although before I do, I want to qualify the quoted claim a bit.
When I say "our goals change over time," I don't mean "we behave something like EU maximizers with time-dependent utility functions." I think we don't behave like EU maximizers, in the sense of having some high-level preference function that all our behavior flows from in a top-down manner.
If we often make choices that are rational in a decision-theoretic sense (given some assumption about the preferences we are trying to satisfy), we are doing so via a "subgoal capacity." This kind of decision-making is available to our outermost loop, and our outermost loop sometimes uses it.
But I don't think the logic of our outermost loop actually is approximate EU maximization -- as evidenced by all the differences between how we deploy our capabilities, and how a "smart EU maximizer" would deploy its own. For instance, IMO we are less power-seeking than an EU maximizer would be (and when humans do seek power, it's often as an end in itself, whereas power-seeking is convergent across goals for EU maximizers).
Saying that "our goals change over time, and we permit/welcome this" is meant to gesture at how different we are from a hypothetical EU maximizer with our capabilities. But maybe this concedes too much, because it implies we have some things called "goals" that play a similar role to the EU maximizer's utility function. I am pretty sure that's not in fact true. We have "goals" in the informal sense, but I don't really know what it is we do with them, and I don't think mapping them on to the "preferences" in a decision-theoretic story about human behavior is likely to yield accurate predictions.
Anyway:
Sorry, you asked for three, that was five. I wanted to cover a range of areas, since one can look at any one of these on its own and imagine a way it might fit inside a decision-theoretic story.
EU maximization with a non-constant world model ("map") might look like 3 and 4, while 5 involves a basic biological function of great interest to natural selection, so we might imagine it a hardwired "special case" not representative of how we usually work. But the ubiquity and variety of this kind of thing, together with our lack of power-seeking etc., does strike me a problem for decision-theoretic interpretations of human behavior at the lifetime level.
Meta-comment of my own: I'm going to have to tap out of this conversation after this comment. I appreciate that you're asking questions in good faith, and this isn't your fault, but I find this type of exchange stressful and tiring to conduct.
Specifically, I'm writing at the level of exactness/explicitness that I normally expect in research conversations, but it seems like that is not enough here to avoid misunderstandings. It's tough for me to find the right level of explicitness while avoiding the urge to put thousands of very pedantic words in every comment, just in case.
Re: non-RL training data.
Above, I used "RL policies" as a casual synecdoche for "sources of Gato training data," for reasons similar to the reasons that this post by Oliver Sourbut focuses on RL/control.
Yes, Gato had other sources of training data, but (1) the RL/control results are the ones everyone is talking about, and (2) the paper shows that the RL/control training data is driving those results (they get even better RL/control outcomes when they drop the other data sources).
Re: gains from transfer..
Yes, if Gato outperforms a particular RL/control policy that generated training data for it, then having Gato is better than merely having that policy, in the case where you want to do its target task.
However, training a Gato is not the only way of reaping gains from transfer. Every time we finetune any model, or use multi-task training, we are reaping gains from transfer. The literature (incl. this paper) robustly shows that we get the biggest gains from transfer when transferring between similar tasks, while distant or unrelated tasks yield no transfer or even negative transfer.
So you can imagine a spectrum ranging from
The difference between Gato (3) and ordinary multi-task pretraining (2) is that, where the latter would only train with a few closely related tasks, Gato also trains on many other less related tasks.
It would be cool if this helped, and sometimes it does help, as in this paper about training on many modalities at once for multi-modal learning with small transformers. But this is not what the Gato authors found -- indeed it's basically the opposite of what they found.
We could use a bigger model in the hope that will get us some gains from distant transfer (and there is some evidence that this will help), but with the same resources, we could also restrict ourselves to less-irrelevant data and then train a smaller (or same-sized) model on more of it. Gato is at one extreme end of this spectrum, and everything suggests the optimum is somewhere in the interior.
Oliver's post, which I basically I agree with, has more details on the transfer results.
I think the reason why it being a unified agent matters is that we should expect significant positive transfer to happen eventually as we scale up the model and train it longer on more tasks. Do you not?
Sure, this might happen.
But remember, to train "a Gato," we have to first train all the RL policies that generate its training data. So we have access to all of them too. Instead of training Gato, we could just find the one policy that seems closest to the target task, and spend all our compute on just finetuning it. (Yes, related tasks transfer -- and the most related tasks transfer most!)
This approach doesn't have to spend any compute on the "train Gato" step before finetuning, which gives it a head start. Plus, the individual policy models are generally much smaller than Gato, so they take less compute per step.
Would this work? In the case of the Lee et al robot problem, yes (this is roughly what Lee et al originally did, albeit with various caveats). In general, I don't know, but this is the baseline that Gato should be comparing itself against.
The question isn't "will it improve with scale?" -- it's 2022, anything worth doing improves with scale -- but "will it ever reach the Pareto frontier? will I ever have a reason to do it?"
As an ML practitioner, it feels like the paper is telling me, "hey, think of a thing you can already do. What if I told you a way to do the same thing, equally well, with an extra step in the middle?" Like, uh, sure, but . . . why?
By contrast, when I papers like AlphaGo, BERT, CLIP, OpenAI diffusion, Chinchilla . . . this is a type of paper where I say, "holy shit, this Fucking Works™, this moves the Pareto frontier." In several of these cases I went out and immediately used the method in the real world and reaped great rewards.
IMO, the "generalist agent" framing is misleading, insofar as it obscures this second-best quality of Gato. It's not really any more an "agent" than my hypothetical cloud drive with a bunch of SOTA models on it. Prompting GATO is the equivalent of picking a file from the drive; if I want to do a novel task, I still have to finetune, just as I would with the drive. (A real AGI, even a weak one, would know how to finetune itself, or do the equivalent.)
We are not talking about an autonomous thing; we're still in the world where there's a human practitioner and "Gato" is one method they can use or not use. And I don't see why I would want to use it.
For what it's worth, I was thoroughly underwhelmed by Gato, to the point of feeling confused what the paper was even trying to demonstrate.
I'm not the only ML researcher who had this reaction. In the Eleuther discord server, I said "i don't get what i'm supposed to take away from this gato paper," and responses from regulars included
Or see this tweet. I'm not trying to convince you by saying "lots of people agree with me!", but I think this may be useful context.
A key thing to remember when evaluating Gato is that it was trained on data from many RL models that were themselves very impressive. So there are 2 very different questions we can ask:
And the answers are a pretty clear, stark "yes" and "no," respectively.
For #2, note that every time the paper investigates transfer, it gets results that are mostly or entirely negative (see Figs 9 and 17). For example, including stuff like text data makes Gato seem more sexily "generalist" but does not actually seem to help anything -- it's like uploading a (low-quality) LM to the same cloud bucket as the RL policies. It just sits there.
In the particular case of the robot stacking experiment, I don't think your read is accurate, for reasons related to the above. Neither the transfer to real robotics, nor the effectiveness of offline finetuning, are new to Gato -- the researchers are sticking as close as they can to what was done in Lee et al 2022, which used the same stacking task + offline finetuning + real robots, and getting (I think?) broadly similar results. That is, this is yet another success of distillation, without a clear value-add beyond distillation.
In the specific case of Lee et al's "Skill Generalization" task, it's important to note that the "expert" line is not reflective of "SOTA RL expert models."
The stacking task is partitioned here (over object shapes/colors) into train and test subsets. The "expert" is trained only on the train subset, and then Lee et al (and the Gato authors) investigate models that are additionally tuned on the test subset in some way or other. So the "expert" is really a baseline here, and the task consists of trying to beat it.
(This distinction made somewhat clearer in an appendix of the Gato paper -- see Fig. 17, and note that the "expert" lines there match the "Dataset" lines from Fig. 3 in Lee et al 2022.)
I've tried the method from that paper (typical sampling), and I wasn't hugely impressed with it. In fact, it was worse than my usual sampler to a sufficient extent that users noticed the difference, and I switched back after a few days. See this post and these tweets.
(My usual sampler one I came up with myself, called Breakruns. It works the best in practice of any I've tried.)
I'm also not sure I really buy the argument behind typical sampling. It seems to conflate "there are a lot of different ways the text could go from here" with "the text is about to get weird." In practice, I noticed it would tend to do the latter at points where the former was true, like the start of a sample or of a new paragraph or section.
Deciding how you sample is really important for avoiding the repetition trap, but I haven't seen sampling tweaks yield meaningful gains outside of that area.
If you're on mobile, try it on desktop. Here's what the sentences look like on my laptop, at 100% zoom.
Hmm... what moral are you drawing from that result?
Apparently, CLIP text vectors are very distinguishable from CLIP image vectors. I don't think this should be surprising. Text vectors aren't actually expressing images, after all, they're expressing probability distributions over images.
They are more closely analogous to the outputs of a GPT model's final layer than they are to individual tokens from its vocab. The output of GPT's final layer doesn't "look like" the embedding of a single token, nor should it. Often the model wants to spread its probability mass across a range of alternatives.
Except even that analogy isn't quite right, because CLIP's image vectors aren't "images," either -- they're probability distributions over captions. It's not obvious that distributions-over-captions would be a better type of input for your image generator than distributions-over-images.
Also note that CLIP has a trainable parameter, the "logit scale," which multiplies the cosine similarities before the softmax. So the overall scale of the cosine similarities is arbitrary (as is their maximum value). CLIP doesn't "need" the similarities to span any particular range. A similarity value like 0.4 doesn't mean anything on its about about how close CLIP thinks the match is. That's determined by (similarity * logit scale).
I completely agree that the effects of using unCLIP are mysterious, in fact the opposite of what I'd predict them to be.
I wish the paper had said more about why they tried unCLIP in the first place, and what improvements they predicted they would get from it. It took me a long time just to figure out why the idea might be worth trying at all, and even now, I would never have predicted the effects it had in practice. If OpenAI predicted them, then they know something I don't.
For instance, it seems like maybe the model that produced the roses on the left-hand side of the diversity-fidelity figure was also given a variable-length encoding of the caption? I'm having a hard time telling from what's written in the paper
Yes, that model did get to see a variable-length encoding of the caption. As far as I can tell, the paper never tries a model that only has a CLIP vector available, with no sequential pathway.
Again, it's very mysterious that (GLIDE's pathway + unCLIP pathway) would increase diversity over GLIDE, since these models are given strictly more information to condition on!
(Low-confidence guess follows. The generator view the sequential representation in its attention layers, and in the new model, these layers are also given a version of the CLIP vector, as four "tokens," each a different projection of the vector. [The same vector is also, separately, added to the model's more global "embedding" stream.] In attention, there is competitive inhibition between looking at one position, and looking at another. So, it's conceivable that the CLIP "tokens" are so information-rich that the attention fixates on them, ignoring the text-sequence tokens. If so, it would ignore some information that GLIDE does not ignore.)
It's also noteworthy that they mention the (much more obvious) idea of conditioning solely on CLIP text vectors, citing Katherine Crawson's work:
Building on this observation, another approach would be to train the decoder to condition on CLIP text embeddings [9] instead of CLIP image embeddings
...but they never actually try this out in a head-to-head comparison. For all we know, a model conditioned on CLIP text vectors, trained with GLIDE's scale and data, would do better than GLIDE and unCLIP. Certainly nothing in the paper rules out this possibility.
Ah, I now realize that I was kind of misleading in the sentence you quoted. (Sorry about that.)
I made it sound like CLIP was doing image compression. And there are ML models that are trained, directly and literally to do image compression in a more familiar sense, trying to get the pixel values as close to the original as possible. These are the image autoencoders.
DALLE-2 doesn't use an autoencoder, but many other popular image generators do, such as VQGAN and the original DALLE.
So for example, the original DALLE has an autoencoder component which can compress and decompress 256x256 images. Its compressed representation is a 32x32 array, where each cell takes a discrete value from 8192 possible values. This is 13 bits per cell (if you don't do any further compression like RLE on it), so you end up with 13 KiB per image. And then DALLE "writes" in this code the same way GPT writes text.
CLIP, though, is not an autoencoder, because it never has to decompress its representation back into an image. (That's what unCLIP does, but the CLIP encoding was not made "with the knowledge" that unCLIP would later come along and try to do this; CLIP was never encouraged to make is code to be especially suitable for this purpose.)
Instead, CLIP is trying to capture . . . "everything about an image that could be relevant to matching it with a caption."
In some sense this is just image compression, because in principle the caption could mention literally any property of the image.But lossy compression always has to choose something to sacrifice, and CLIP's priorities are very different from the compressors we're more familiar with. They care about preserving pixel values, so they care a lot about details. CLIP cares about matching with short (<= ~70 word) captions, so it cares almost entirely about high-level semantic features.
I personally doubt that this is true, which is maybe the crux here.
This seems like a possibly common assumption, and I'd like to see a more fleshed-out argument for it. I remember Scott making this same assumption in a recent conversation:
But is it true that "optimizers are more optimal"?
When I'm designing systems or processes, I tend to find that the opposite is true -- for reasons that are basically the same reasons we're talking about AI safety in the first place.
A powerful optimizer, with no checks or moderating influences on it, will tend to make extreme Goodharted choices that look good according to its exact value function, and very bad (because extreme) according to almost any other value function.
Long before things reach the point where the outer optimizer is developing a superintelligent inner optimizer, it has plenty of chances to learn the general design principle that "putting all the capabilities inside an optimizing outer loop ~always does something very far from what you want."
Some concrete examples from real life:
That would look like "setting up a single training run, running it, and then using the model artifact that results, without giving yourself freedom to go back and do it over again (unless you can find a way to automate that process itself with gradient descent)." This is a peculiar policy which no one follows. The individual artifacts resulting from individual training runs are quite often bad -- they're overfit, or underfit, or training diverged, or they got great val metrics but the output sucks and it turns out your val set has problems, or they got great val metrics but the output isn't meaningfully better and the model is 10x slower than the last one and the improvement isn't worth it, or they are legitimately the best thing you can get on your dataset but that causes you to realize you really need to go gather more data, or whatever.
All the impressive ML artifacts made "by gradient descent" are really outputs of this sort of process of repeated experimentation, refining of targets, data gathering and curation, reframing of the problem, etc. We could argue over whether this process is itself a form of "optimization," but in any case we have in our hands a (truly) powerful thing that very clearly is optimization, and yet to leverage it effectively without getting Goodharted, we have to wrap it inside some other thing.
"How would I want people to behave if I – as in actual me, not a toy character like Alice or Bob – were managing a team of people on some project? I wouldn’t want them to be ruthless global optimizers; I wouldn’t want them to formalize the project goals, derive their paperclip-analogue, and go off and do that. I would want them to take local iterative steps, check in with me and with each other a lot, stay mostly relatively close to things already known to work but with some fraction of time devoted to far-out exploration, etc."
There are of course many Goodhart horror stories about organizations that focus too hard on metrics. The way around this doesn't seem to be "find the really truly correct metrics," since optimization will always find a way to trick you. Instead, it seems crucial to include some mitigating checks on the process of optimizing for whatever metrics you pick.
Mostly self-explanatory. Admittedly a dictator is not likely to be a coherent optimizer, but I expect a dictatorship to behave more like one than a parliamentary democracy.
If coherence is a convergent goal, why don't all political sides come together and build a system that coherently does something, whatever that might be? In this context, at least, it seems intuitive enough that no one really wants this outcome.
In brief, I don't see how to reconcile
EDIT: an additional consideration applies in the situation where the AI is already at least as smart as us, and can modify itself to become more coherent. Because I'd expect that AI to notice the existence of the alignment problem just as much as we do (why wouldn't it?). I mean, would you modify yourself into a coherent EU-maximizing superintelligence with no alignment guarantees? If that option became available in real life, would you take it? Of course not. And our hypothetical capable-but-not-coherent AI is facing the exact same question.