Image GPT

by Daniel Kokotajlo1 min read18th Jun 202027 comments

30

GPTAI
Frontpage
This is a linkpost for https://openai.com/blog/image-gpt/

My hot take:

Not too surprising to me, considering what GPT-3 could do. However there were some people (and some small probability mass remaining in myself) saying that even GPT-3 wasn't doing any sort of reasoning, didn't have any sort of substantial understanding of the world, etc. Well, this is another nail in the coffin of that idea, in my opinion. Whatever this architecture is doing on the inside, it seems to be pretty capable and general.

I don't think this architecture will scale to AGI by itself. But the dramatic success of this architecture is evidence that there are other architectures, not too far away in search space, that exhibit similar computational efficiency and scales-with-more-compute properties, that are useful for more different kinds of tasks.

27 comments, sorted by Highlighting new comments since Today at 10:05 AM
New Comment

My first thought was that they put some convolutional layers in to preprocess the images and then used the GPT architecture, but no, it's literally just GPT again....

Does this maybe give us evidence the brain isn't anywhere near a peak of generality, since we use specialised circuits for processing image data (which convolutional layers were based off of)

Not necessarily. There is no gene which hardcodes a convolutional kernel into the brain which we can look at and say, 'ah yes, the brain is implementing a convolution, and nothing else'. Attention mechanisms for images learn convolution-like patterns (just more flexibly, and not pre-hardwired): to the extent that convolutions are powerful because they learn things like spatial locality (which is obviously true & useful), we would expect any more general learning algorithm to also learn similar patterns and look convolution-like. (This is a problem which goes back at least to, I think, Von Neumann: the fact that the brain is universal makes it very hard to tell what algorithm it actually runs, as opposed to what algorithms it has learned to run.)

This video conjectures that GPT-3 was literally just saving everything from the training corpus and remixing them, without complex reasoning. https://www.youtube.com/watch?v=SY5PvZrJhLE

The same conjecture could work for GPT-I

So, GPT-3 is something like Giant look-up table? Which approximate the answer between a few nearest recorded answers, but the whole actual intellectual work was performed by those who created the training dataset?

Thanks for sharing that. Now having watched the video, I am updating towards that position. I'm now only something like 80% confident that reasoning isn't a roadblock. I look forward to learning whether GPT-3 can do word scrambling tasks.

Consider the two questions:

1. Does GPT-3 have "reasoning" and "understanding of the world"?

2. Does iGPT have "reasoning" and "understanding of the world"?

According to me, these questions are mostly separate, and answering one doesn't much help you answer the other.

So:

However there were some people (and some small probability mass remaining in myself) saying that even GPT-3 wasn't doing any sort of reasoning, didn't have any sort of substantial understanding of the world, etc. Well, this is another nail in the coffin of that idea, in my opinion. Whatever this architecture is doing on the inside, it seems to be pretty capable and general.

... I don't understand what you mean here. The weights of image GPT are different from the weights of regular GPT-3, only the architecture is the same. Are you claiming that just the architecture is capable of "reasoning", regardless of the weights?

Or perhaps you're claiming that for an arbitrary task, we could take the GPT-3 architecture and apply it to that task and it would work well? But it would require a huge dataset and lots of training -- it doesn't seem like that should be called "reasoning" and/or "general intelligence".

Yeah I guess I'm confused what you're claiming here.

I'm neither claiming that just the architecture is reasoning, nor that the architecture would work for any task. I'm also not saying GPT is a general intelligence. I agree that GPT-3 and iGPT are separate things. However, what happens with one can be evidence for what is going on inside the other, given that they have the same architecture.

What I'm thinking is this: The path to AGI may involve "roadblocks," i.e. things that won't be overcome easily, i.e. things that won't be solved simply by tweaking and recombining our existing architectures and giving them orders of magnitude more compute, data, etc. Various proposals have been made for possible roadblocks, in the form of claims about what current methods cannot do: Current methods can't do long-term planning, current methods can't do hidden-information games, current methods can't do reasoning, current methods can't do common sense, etc.

Occasionally something which is hypothesized to be a roadblock turns out not to be. E.g. it turns out AlphaStar, OpenAI Five, etc. work fine with hidden information games, and afaik this didn't involve any revolutionary new insights but just some tweaking and recombining of existing ideas along with loads more compute.

My claim is that the GPTs are evidence against reasoning and common sense understanding being roadblocks. There may be other roadblocks. And probably GPT isn't "reasoning" nearly as well or as comprehensively and generally as we humans do. Similarly, it's common sense isn't as good as mine. But it has a common sense, and it's improving as we make bigger and bigger GPTs.

One thing I should say as a caveat is that I don't have a clear idea of what people mean when they say reasoning is a roadblock. I think reasoning is a fuzzy and confusing concept. Perhaps I am wrong to say this is evidence against reasoning being a roadblock, because I'm misunderstanding what people mean by reasoning. I'd love to hear someone explain carefully what reasoning is and why it's likely a roadblock.

Stuart Russell gave his list of roadblocks, which is relevant as he (might) have just made a claim that was falsified by GPT3, in that same interview -

The first thing is that the Go board is fully observable. You can see the entire state of the world that matters. And of course in the real world there’s lots of stuff you don’t see and don’t know. Some of it you can infer by accumulating information over time, what we call state estimation, but that turns out to be quite a difficult problem. Another thing is that we know all the rules of Go, and of course in the real world, you don’t know all the rules, you have to learn a lot as you go along. Another thing about the Go board is that despite the fact that we think of it as really complicated, it’s incredibly simple compared to the real world. At any given time on the Go board there’s a couple of hundred legal moves, and the game lasts for a couple hundred moves.
And if you said, well, what are the analogous primitive actions in the real world for a human being? Well, we have 600 muscles and we can actuate them maybe about 10 times per second each. Your brain probably isn’t able to do that, but physically that’s what could be your action space. And so you actually have then a far greater action space. And you’re also talking about… We often make plans that last for many years, which is literally trillions of primitive actions in terms of muscle actuations. Now we don’t plan those all out in detail, but we function on those kinds of timescales. Those are some of the ways that Go and the real world differ. And what we do in AI is we don’t say, okay, I’ve done Go, now I’m going to work on suicide Go, and now I’m going to work on chess with three queens.
What we try to do is extract the general lessons. Okay, we now understand fairly well how to handle that whole class of problems. Can we relax the assumptions, these basic qualitative assumptions about the nature of the problem? And if you relax all the ones that I listed, and probably a couple more that I’ve got

So dealing with partial observability, discovering new action sets, managing mental activity (?) and some others. This seems close to the list in an older post I wrote:

Stuart Russell’s List
human-like language comprehension
cumulative learning
discovering new action sets
managing its own mental activity
For reference, I’ve included two capabilities we already have that I imagine being on a similar list in 1960
perception and object recognition
efficient search over known facts

If AlphaStar is evidence that partial observability isn't going to be a problem, is GPT3 similarly evidence that language comprehension isn't going to be a problem, since GPT3 can do things like simple arithmetic? That leaves cumulative learning, discovering action sets and managing mental activity on Stuart's list.

I don't know whether reasoning is a roadblock or not, but I discuss some ways in which GPT doesn't have it in this comment.

Thanks, I agree AlphaStar doesn't seem to have it. What do you think about GPT's arithmetic and anagram stuff? Also, you say that AIdungeon uses GPT-3, but their "About" page still says they use GPT-2. Anyhow, I now think I was too confident in my original claim, and am further revising downwards.

AI Dungeon definitely uses GPT-3. Look at their video+blurb on the Beta page, note the updates page mentions "Double the Memory!: AI Dungeon has double the memory! If you didn't hear, we recently upgraded our AI. With that upgrade the AI can now remember twice as much!" (there is no GPT-2 with a context window of 2048). I've also discussed this with Walton. I don't know know why people find it so hard to believe that maybe a tiny startup doesn't update every last piece of documentation instantaneously.

OK, thanks. I don't find that hard to believe at all.

Ah I see, that makes more sense, thanks!

This isn't news, we knew that sequence predictors could model images for almost a decade now and openAI did the same thing last year with less compute, but no one noticed.

Thanks for pointing this out--funnily enough, I actually read the OpenAI thing last year and thought it was cool, but then forgot about it by the time this came out! (The thing from a decade ago I hadn't heard of)

I very definitely noticed Sparse Transformer, but what you're missing is that Sparse Transformers showed good compression performance but was small-scale & primarily about describing the Sparse Transformer/showing it works, and there's nothing about few-shot/transfer learning. There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT, somewhat like the problem in finding the equivalent of Gram matrices in text models (unlike the semi-supervised CNNs where you can expect the embedding or pre-embedding to distill all the knowledge into one place, by design), and you can see in iGPT that getting the representation out is nontrivial - you can easily pick a bad layer to use as the embedding.

There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT,

Personally, I felt that that wasn't really surprising either. Remember that this whole deep learning thing started with exactly what OpenAI just did. Train a generative model of the data, and then fine tune it to the relevant task.

However, I'll admit that the fact that theres an optimal layer to tap into, and that they showed that this trick works specifically with transformer autoregressive models is novel to my knowledge.

Being able to accomplish something is important even if it was predicted to be possible. No one is surprised that generative models do embody a lot of useful knowledge (that's much of the point), but it can be hard to tap into it.

The difference between GPT & iGPT for transfer learning is that GPT can be queried directly via its modality by putting in text: "Translate this into French", "what genre of text is this?", "tldr", etc. On the other hand, if you were querying iGPT by handing it half an image and expecting it to complete it in a useful way, there is absolutely nothing surprising about that being useful, obviously; but I have a hard time thinking of how you could implement classification by image completion! You normally have to get the knowledge out a different way, through an embedding which can be fed into a linear classification layer; if you can't do that, it's unclear what exactly you do. It was unclear how you use Sparse Transformers, PixelRNN, GANs, etc to do any of that. Now it's clearer.

As an analogous example, consider textual style transfer. You can't do it (pre-GPT-3, anyway). Do char-RNNs and Transformers understand the difference between authors and styles and content? Are they capable of textual style transfer? I would be shocked if they weren't. Probably, yes, after all, they can uncannily mimic authors and write plausibly about all sorts of content. But nevertheless, they lack a Gram matrix like CNNs you can easily optimize to do style transfer with. So, no one can do it. Someone finally figuring out how to do it would be big news even if the end output is not surprising.

How hard do you think it would be to do Image GPT but for video? That sounds like it could be pretty cool to see. Probably can be used to create some pretty trippy shit. Once it gets really good it could be used in robotics. Come to think of it, isn't that sorta what self driving cars need? Something that looks at a video of the various things happening around the car and predicts what's going to happen next?

Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive... Since iGPT is pretty expensive, I don't expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There's already RNNs for 64px video out 25 frames, for example. I'm not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it's not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don't show good results.)

Right. The use case I had in mind for electric cars was the standard "You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, ... etc." That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.

A possible example of the Ernest Rutherford effect (respected scientist says a thing isn't going to happen and then the next day it does), Stuart Russell speaking in a recent podcast

Deep learning systems are needing, even for these relatively simple concepts, thousands, tens of thousands, millions of examples, and the idea within deep learning seems to be that well, the way we’re going to scale up to more complicated things like learning how to write an email to ask for a job, is that we’ll just have billions or trillions of examples, and then we’ll be able to learn really, really complicated concepts. But of course the universe just doesn’t contain enough data for the machine to learn direct mappings from perceptual inputs or really actually perceptual input history. So imagine your entire video record of your life, and that feeds into the decision about what to do next, and you have to learn that mapping as a supervised learning problem. It’s not even funny how unfeasible that is. The longer the deep learning community persists in this, the worse the pain is going to be when their heads bang into the wall.

I could be wrong but GPT3 probably could write a passable job application letter

For kicks, I wrote a job application prompt for GPT-3. Here's the result: https://pastebin.com/BQDnqqjd (I suspect P&G will be hiring the first candidate - but you miss 100% of the shots you don't take!).

Hilarious.

I also worked for a day as a cashier at a grocery store.

Wow, this is really good and really funny! I don't know if it counts as knowing how to write an email to ask for a job. On the one hand it knows like 99% of it... but on the other hand even the first letter comes across as immature.

True, but it's not like I wrote a very serious, mature prompt either; real job ads are much more boring and jargony. (I could try harder but my poetry explorations are more interesting to me.) I suspect that with that prompt, it creates a humorous storytelling frame and thus I am tapping into the fiction-writing skills as well: the genre of fake job application letter does exist, and sounds a lot like the second letter (eg Joey Comeau's Overqualified).

It's not clear to me that Stuart was saying we won't be able to use deep learning to write a job application letter--rather, perhaps he just meant that deep learning folks typically seem to think that we'll be able to do this via supervised learning, but they are wrong because we'll never have enough data. Idk. You might be right.