The surprising parameter efficiency of vision models

beren

LESSWRONG
LW

The surprising parameter efficiency of vision models — LessWrong

81 The surprising parameter efficiency of vision models

by beren

8th Apr 2023

5 min read

81

Crossposted from my personal blog.

Epistemic status: This is a short post meant to highlight something I do not yet understand and therefore a potential issue with my models. I would also be interested to hear if anybody else has a good model of this.

Why do vision (and audio) models work so well despite being so small? State of the art models like stable diffusion and midjourney work exceptionally well, generating near-photorealistic art and images and give users a fair degree of controllability over their generations. I would estimate with a fair degree of confidence that the capabilities of these models probably surpass the mental imagery abilities of almost all humans (they definitely surpass mine and a number of people I have talked to). However, these models are also super small in terms of parameters. The original stable diffusion is only 890M parameters.

In terms of dataset size, image models are at a rough equality with humans. The stable diffusion dataset is 2 billion images. Assuming that you see 10 images per second every second you are awake and that you are awake 18 hours a day, you can observe 230 million images per year and so get the same data input as stable diffusion after 10 years. Of course, the images you see are much more redundant and we made some highly aggressive assumptions but after a human lifetime being in the same OOM as a SOTA image model is not insane. On the other hand, the hundreds of billions to trillions of tokens fed to LLMs is orders of magnitude beyond what humans could ever experience.

A similar surprising smallness occurs in audio models. OpenAI's Whisper can do almost flawless audio transcription (including multilingual translation!) with just 1.6B parameters.

Let's contrast this to the brain. Previously, I estimated that we should expect the visual cortex to have on the order of 100B parameters, if not more. The auditory cortex should be of roughly the same order of magnitude, but slightly smaller than the visual cortex. That is two orders of magnitude larger than state of the art DL models in these modalities.

This contrasts with state of the art language models which appear to be approximately equal to the brain in parameter count and abilities. Small (1-10B) language models are clearly inferior to the brain at producing valid text and completions as well as standard question answering and factual recall tasks. Human parity in factual knowledge is reached somewhere between GPT-2 and GPT-3. Human language abilities are still not entirely surpassed with GPT-3 (175B parameters) or GPT-4 (presumably significantly larger). This puts large language models within approximately the same order of magnitude as the human linguistic cortex.

What could be the reasons for this discrepancy? Off the top of my head I can think of a number which are below (and ranked by rough intuitive plausibility), and it would be interesting to try to investigate these further. Also, if anybody has ideas or evidence either way please send me a message.

1.) The visual cortex vs image models is not a fair comparison. The brain does lots of stuff image generation models can't do such as parse and render very complex visual scenes, deals with saccades and having two eyes, and, crucially, handle video data and moving stimuli. We haven't fully cracked video yet and it is plausible that to do so existing vision models require an OOM or two more of scale.

2.) There are specific inefficiencies in the brain's processing of images that image models skip which do not apply to language models. One very obvious example of this is convolutions. While CNNs have convolutional filters which are applied to all tiles of the image individually, the brain cannot do this and so must laboriously have separate neurons and synapses encode each filter. Indeed, much of the processing in the retina, lateral geniculate nucleus, and even V1 appears to be taken up with extremely simple filters (such as Gabors, edge detectors, line detectors etc) copied over and over again for different image patches. This 'artificially' inflates the parameter count of the visual cortex vs ML models such that the visual cortex' 'effective parameter count' is much smaller than appears. However, I doubt this can be the whole story as recent image models such as stable diffusion use increasingly transformer-like architectures (residual stream + attention) rather than convolutions for most of the image processing pipeline. Similarly, Whisper only has 1 conv block at the beginning before transitioning into an attention based architecture.

3.) Parameter count is the wrong way to assess diffusion models. Unlike feedforward NNs such as transformers or earlier vision models such as GANs/VAEs, diffusion models generate (and are trained) using a reasonably large number of diffusion steps to iteratively 'decode' an image. This process is very similar to the iterative inference via recurrence that occurs in the brain. However, unlike diffusion models, the brain supports a single feedforward amortized sweep to achieve core object recognition (otherwise your vision would be too slow to detect important things such as predators in time). It is possible that the iterative inference supported by diffusion models is more parameter efficient than a direct amortized net would be, and thus gets a saving over the brain in this way. While there are very good VAEs/GANs in existence and at scale, it may be that these need to have an OOM or more parameters to be competitive with diffusion models. Note that in terms of computational cost, since a forward pass through an amortized net is so much cheaper than a generation with a diffusion network (a diffusion network generation is effectively N amortized forward passes where N is the number of diffusion steps) then comparable VAEs/GANs may actually be cheaper to run even if much larger.

4.) Our assessment of LLM abilities is wrong and existing LLMs are just vastly superhuman and GPT-2 style models are actually at human parity. This seems strongly unlikely from actually interacting with these models, but on the other hand, even GPT-2 models possess a lot of arcane knowledge which is superhuman and it may be that the very powerful cognition of these small models is just smeared across such a wide range of weird internet data that it appears much weaker than us in any specific facet. Intuitively, this would be that a human and GPT-2 possess the same 'cognitive/linguistic power' but that since GPT-2's cognition is spread over a much wider data range than a human, it's 'linguistic power density' is lower and therefore appears much less intelligent in the much smaller human-relevant domain in which we test it. I am highly unclear whether these concepts are actually correct or a useful frame through which to view things.

5.) Language models are highly inefficient and can be made much smaller without sacrificing much performance. For whatever reason, we may just be training language models badly or doing something else wrong and it is in fact possible to get 1 or 2 OOMs of parameter efficiency out of current language models. If this were true, it would be massive since it would shrink a GPT-4 level model into a trivially open-sourceable and highly hackable 'small' LLM. For instance, GPT-4 is unlikely to be more than 1 trillion dense parameters. Two orders of magnitude would shrink it to a 10B model, approximately the same sizes as the Llama 11B and smaller than neox-20B, and which would be straightforward to inference on even consumer-grade cards. There is some evidence for this in reasonably large amounts of pruning being possible, but to me it seems that an actual 2 OOM shrinking is unlikely.

Machine Learning (ML)AI

Frontpage

81

New Comment

28 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:03 AM

[-]jacob_cannell3y2010

Ahh this is extremely simple and obvious to those with deep wisdom from computer graphics - the closer you get to low level physics cellular automata the more effective parameter sharing is. Conceptualize the space of all physics approximation functions organized by spatiotemporal scale. At the most detailed finest scale physics is uniform across space and time so a small amount of code/params describes all of spacetime. As you advance up the approximation abstraction tree to larger temporal spatial scales code complexity and specificity increases with language models sitting at the top.

At the lowest levels of vision all the features are spatially invariant and thus the natural connection matrix is highly compressible through simple weigh sharing, but diminishing exponentially as you advance up abstraction levels.

[-]Elias Schmied3y10

But this would apply to the visual cortex as well right? So it doesn't explain the discrepancy.

[-]jacob_cannell3y60

It does of course apply to the visual cortex, so I don't understand your comment. Essentially the answer is #2 in the OP's list. CNNs are like the visual cortex but highly compressed through weight sharing, which is easy for a von neumman machine but isn't really feasible for a slow neuromorphic computer like the brain.

The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.

[-]beren3y40

The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.

Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?

[-]jacob_cannell3y61

Consider a vision transformer - or more generally an RNN - which predicts the entire image at once (and thus has hidden states that are larger than the image due to depth and bottleneck layers etc). That obviously wouldn't exploit weight sharing at all, but is really the only option if you are running a transformer or RNN on an ultra-slow ultra-wide 100hz neuromorphic computer like the brain and have tight latency constraints.

But of course that isn't the only or most sensible option on a GPU. Instead you can use a much smaller transformer/RNN over a stream of image patches instead of the entire image at once, which then naturally exploits weight sharing very much like CNNs. Ultimately vision transformers and CNNs both map to matrix multiplication, which always involves weight sharing. The interesting flip consequence is that a brain-like architecture - a massive RNN - doesn't naturally map to matrix multiplication at all and thus can't easily exploit GPU acceleration.

[-]Elias Schmied3y10

Oh, I see what you're saying now. Thanks for clarifying.

[-]anonymousaisafety3y94

I suspect it is a combination of #3 and #5.

Regarding #5 first, I personally think that language models are being trained wrong. We'll get OoM improvements when we stop randomizing the examples we show to models during training, and instead provide examples in a structured curriculum.

This isn't a new thought, e.g. https://arxiv.org/abs/2101.10382

To be clear, I'm not saying that we must present easy examples first and then harder examples later. While that is what has been studied in the literature, I think we'd actually get better behavior by trying to order examples on a spectrum of "generalizes well" to "very specific, does not generalize" and then training in that order. Sometimes this might be equivalent to "easy examples first", but that isn't necessarily true.

I recognize that the definitions of "easy" and "generalizes" are nebulous, so I'm going to try and explain the reasoning that led me here.

Consider the architecture of transformers and feed-forward neural networks (specifically not recurrent neural networks). We're given some input, and we produce some output. In a model like GPT, we're auto-regressive, so as we produce our outputs, those outputs become part of the input during the next step. Each step is fundamentally a function .

Given some input, the total output can be thought as:

def reply_to(input):
output = ""
while True:
token = predict_next(input + output)
if token == STOP:
break
output += token
return output

We'd like to know exactly what `predict_next` is doing, but unfortunately, the programmer who wrote it seems to have done their implementation entirely in matrix math and they didn't include any comments. In other words, it's deeply cursed and not terribly different from the output of Simulink's code generator.

def predict_next(input):
# ... matrix math
return output

Let's try to think about the capabilities and constraints on this function.

There is no unbounded `loop` construct. The best we can do is approximate loops, e.g. by supporting an unrolled loop up to some bounded number of iterations. What determines the bounds? Probably the depth of the network?
If the programmer were sufficiently deranged, they could implement `predict_next` in such a way that if they've hit the bottom of their unrolled loop, they could rely on the fact that `predict_next` will be called again, and continue their previous calculations during the next call. What would be the limitations on this? Probably the size of each hidden layer. If you wanted to figure out if this is happening, you'd want to look for prompts where the network can answer the prompt correctly if it is allowed to generate text before the answer (e.g. step-by-step explanations) but is unable to do so if asked to provide the answer without any associated explanations.
How many total "instructions" can fit into this function? The size of the network seems like a decent guess. Unfortunately, the network conflates instructions and data, and the network must use all parameters available to it. This leads to trivial solutions where the network just over-fits to the data (analogous to baking in a lookup table on the stack). It's not unsurprising that throwing OoM more data at a fixed size NN results in better generalization. Once you're unable to cheat with over-fitting you must learn algorithms that work more efficiently.

The reason why I'm discussing the network in the language of instructions, stack space, and loops is because I disagree with a blanket statement like "scale is all you need". I think it's obvious that scaling the neural network is a patch on the first two constraints, and scaling the training data is a patch on the third constraint.

This is also why I think that point #3 is relevant. If GPT-3 does so well because it's using the sea of parameters for unrolled loops, then something like Stable Diffusion at 1/200th the size probably makes sense.

To tie this back to point #5:

We start with a giant corpus of data. On the order of "all written content available in digital form". We might generate additional data in an automated fashion, or digitize books, or caption videos.
We divide it into training data and test data.
We train the network on random examples from the training data, and then verify on random examples from the test data. For simplicity, I'm glossing over various training techniques like masking data or connections between nodes.
Then we fine-tune it, e.g with Q&A examples.
And then generally we deploy it with some prompt engineering, e.g. prefixing queries with past transcript history, to fake a conversation.

At the end of this process, what do we have?

I want to emphasize that I do not think it is a "stochastic parrot". I think it is very obvious that the final system has internalized actual algorithms (or at least, pseudo-algorithms due to the limitation on loops) for various tasks, given the fact that the size of the data set is significantly larger than the size of the model. I think people who are surprised by the capabilities of these systems continue to assume it is "just" modeling likelihoods, when there was no actual requirement on that.

I also suspect we've wasted an enormous quantity of our parameters on embedding knowledge that does not directly contribute to system's capabilities.

My hypothesis for how to fix this is vaguely similar to the idea of "maximizing divergence" discussed here https://ljvmiranda921.github.io/notebook/2022/08/02/splits/.

I think we could train a LLM on a minimal corpus to "teach" a language^[1] and then place that LLM inside of a larger system that we train to minimize loss on examples teaching logic, mathematics, and other components of reasoning. That larger system would distinguish between the weights for the algorithms it learns and the weights representing embedded knowledge. It would also have the capability to loop during the generation of an output. For comparison, think of the experiments being done with hooking up GPT-4 to a vector database, but now do that inside of the architecture instead of as a hack on top of the text prompts.

I think an architecture that cleanly separates embedded knowledge ("facts", "beliefs", "shards", etc) from the algorithms ("capabilities", "zero-shot learning") is core to designing a neural network that remains interpretable and alignable at scale.

If you read the previous paragraphs and think, "that sounds familiar", it's probably because I'm describing how we teach humans: first language, then reasoning, then specialization. A curriculum. We need language first because we want to be able to show examples, explain, and correct mistakes. Especially since we can automate content generation with existing LLMs to create the training corpus in these steps. Then we want to teach reasoning, starting with the most general forms of reasoning, and working into the most specific. Finally, we grade the system (not train!) on a corpus of specific knowledge-based activities. Think of this step as describing the rules of a made-up game, providing the current game state, and then asking for the optimal move. Except that for games, for poems, for math, for wood working, for engineering, etc. The whole point of general intelligence is that you can reason from first principles, so that's what we need to be grading the network on: minimizing loss with respect to arbitrarily many knowledge-based tasks that must be solved using the facts provided only during the test itself.

^{^}
Is English the right language to teach? I think it would be funny if a constructed language actually found a use here.

[-]quanticle3y3-1

That's a fair criticism, but why would it apply to only language models? We also train visual models with a randomized curriculum, and we seem to get much better results. Why would randomization hurt training efficiency for language generation but not image generation?

[-]anonymousaisafety3y101

First, when we say "language model" and then we talk about the capabilities of that model for "standard question answering and factual recall tasks", I worry that we've accidentally moved the goal posts on what a "language model" is.

Originally, a language model was a stochastic parrot. They were developed to answer questions like "given these words, what comes next?" or "given this sentence, with this unreadable word, what is the most likely candidate?" or "what are the most common words?"^[1] It was not a problem that required deep learning.

Then, we applied deep learning to it, because the path of history so far has been to take straightforward algorithms, replace them with a neural network, and see what happens. From that, we got ... stochastic parrots! Randomizing the data makes perfect sense for that.

Then, we scaled it. And we scaled it more. And we scaled it more.

And now we've arrived at a thing we keep calling a "language model" due to history, but it isn't a stochastic parrot anymore.

Second, I'm not saying "don't randomize data", I'm saying "use a tiered approach to training". We would use all of the same techniques: randomization, masking, adversarial splits, etc. What we would not do is throw all of our data and all of our parameters into a single, monolithic model and expect that would be efficient.^[2] Instead, we'd first train a "minimal" LLM, then we'd use that LLM as a component within a larger NN, and we'd train that combined system (LLM + NN) on all of the test cases we care about for abstract reasoning / problem solving / planning / etc. It's that combined system that I think would end up being vastly more efficient than current language models, because I suspect the majority of language model parameters are being used for embedding trivia that doesn't contribute to the core capabilities we recognize as "general intelligence".

^{^}
This wasn't for auto-complete, it was generally for things like automatic text transcription from images, audio, or videos. Spam detection was another use-case.
^{^}
Recall that I'm trying to offer a hypothesis for why a system like GPT-3.5 takes so much training and has so many parameters and it still isn't "competent" in all of the ways that a human is competent. I think "it is being trained in an inefficient way" is a reasonable answer to that question.

[-]quanticle3y31

Okay, that's all fair, but it still doesn't answer my question. We don't do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?

Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?

[-]anonymousaisafety3y61

We don't do any of these things for diffusion models that output images, and yet these diffusion models manage to be much smaller than models that output words, while maintaining an even higher level of output quality. What is it about words that makes the task different?

I'm not sure that "even higher level of output quality" is actually true, but I recognize that it can be difficult to judge when an image generation model has succeeded. In particular, I think current image models are fairly bad at specifics in much the same way as early language models.

But I think the real problem is that we seem to still be stuck on "words". When I ask GPT-4 a logic question, and it produces a grammatically correct sentence that answers the logic puzzle correctly, only part of that is related to "words" -- the other part is a nebulous blob of reasoning.

I went all the way back to GPT-1 (117 million parameters) and tested next word prediction -- specifically, I gave a bunch of prompts, and I looked for only if the very next word was what I would have expected. I think it's incredibly good at that! Probably better than most humans.

Or are you suggesting that image generators could also be greatly improved by training minimal models, and then embedding those models within larger networks?

No, because this is already how image generators work. That's what I said in my first post when I noted the architectural differences between image generators and language models. An image generator, as a system, consists of multiple models. There is a text -> image space, and then an image space -> image. The text -> image space encoder is generally trained first, then it's normally frozen during the training of the image decoder.^[1] Meanwhile, the image decoder is trained on a straightforward task: "given this image, predict the noise that was added". In the actual system, that decoder is put into a loop to generate the final result. I'm requoting the relevant section of my first post below:

The reason why I'm discussing the network in the language of instructions, stack space, and loops is because I disagree with a blanket statement like "scale is all you need". I think it's obvious that scaling the neural network is a patch on the first two constraints, and scaling the training data is a patch on the third constraint.
This is also why I think that point #3 is relevant. If GPT-3 does so well because it's using the sea of parameters for unrolled loops, then something like Stable Diffusion at 1/200th the size probably makes sense.

^{^}
Refer to figure 2 in https://cdn.openai.com/papers/dall-e-2.pdf. Or read this:
The trick here is that they decoupled the encoding from training the diffusion model. That way, the autoencoder can be trained to get the best image representation and then downstream several diffusion models can be trained on the so-called latent representation
This is the idea that I'm saying could be applied to language models, or rather, to a thing that we want to demonstrate "general intelligence" in the form of reasoning / problem solving / Q&A / planning / etc. First train a LLM, then train a larger system with the LLM as a component within it.

[-][anonymous]3y10

Might feel validated by this - https://arxiv.org/abs/2305.07759

[-]Nanda Ale3y10

Are people doing anything in LLMs like the classic StyleGAN training data bootstrapping pattern?

Start with bad data, train a bad model. It's bad but it's still good enough to rank your training data. Now you have better training data. Train a better model. The architecture is different of course, but is there anything analogous?

[-]anonymousaisafety3y20

Yes, it's my understanding that OpenAI did this for GPT-4. It's discussed in the system card PDF. They used early versions of GPT-4 to generate synthetic test data and also as an evaluator of GPT-4 responses.

[-]Vladimir_Nesov3y82

A thing that stands out more starkly to me than the number of parameters is that GPT-3 is able to produce human-like text by passing a merely 12K-dimensional vector along its residual stream, through merely 96 successive steps of elementary transformations (even if each transformation is coded by lots of parameters). No features that require more contemplation than that can be computed during inference, deliberative thought that's not written out in tokens needs to fit in there. The whole context needs to be comprehended within that limit, from the first word to the last one.

Perhaps GPT-4's advantage is in having enough layers to also think a bit more about what's going on and carry out more complicated plans without thinking out loud in tokens, capturing in the middle layers the features that human brain would need multiple recurrent passes to formulate. But the low dimension is still shocking, it wasn't obvious with older NLP work that this order of amount of data is enough to capture human-level cognition, since 1K-dimensional embeddings are also good for indexing/search and it wasn't clear how far up the nuance would need to go.

[-]tailcalled3y64

4.) Our assessment of LLM abilities is wrong and existing LLMs are just vastly superhuman and GPT-2 style models are actually at human parity. This seems strongly unlikely from actually interacting with these models, but on the other hand, even GPT-2 models possess a lot of arcane knowledge which is superhuman and it may be that the very powerful cognition of these small models is just smeared across such a wide range of weird internet data that it appears much weaker than us in any specific facet. Intuitively, this would be that a human and GPT-2 possess the same 'cognitive/linguistic power' but that since GPT-2's cognition is spread over a much wider data range than a human, it's 'linguistic power density' is lower and therefore appears much less intelligent in the much smaller human-relevant domain in which we test it. I am highly unclear whether these concepts are actually correct or a useful frame through which to view things.

I think LLMs are great and plausibly superhuman at language, it's just that we don't want them to do language, we want them to do useful real-world tasks, and hijacking a language model to do useful real-world tasks is hilariously inefficient.

If you consider pure language tasks like "Here's some information in format X, please reshuffle it to the equivalent in format Y", then GPT-4 seems vastly superhuman. (I'm somewhat abusing terms here since the "language" task of reshuffling information is somewhat different than the "language" task of autoregressively predicting information, but I think they are probably way more closely related tasks than if you want to apply it to something useful? Idk.) Can't remember anything about how good GPT-2 was at this, not sure I even bothered to try it.

[-]DragonGod3y80

IIRC Redwood research investigated human performance on next token prediction, and humans were mostly worse than even small (by current standards) language models?

[-]the gears to ascension3y40

sounds right, where "worse" here means "higher bit per word at predicting an existing sentence", a very unnatural metric humans don't spend significant effort on.

[-]jacob_cannell3y42

That is actually a natural metric for the brain and close to what the linguistic cortex does internally. The comparison of having a human play a word prediction game and comparing logit scores of that to the native internal logit predictions of an LLM is kinda silly. The real comparison should be between a human playing that game and LLM playing the exact same game in the exact same way (ie asking GPT verbally to predict the logit score of the next word/token), or you should comapre internal low level transformer logit scores to linear readout models from brain neural probes/scans.

[-]the gears to ascension3y20

oh interesting point, yeah.

[-]quanticle3y2-5

I think LLMs are great and plausibly superhuman at language

I think the problem might be that "language" encompasses a much broader variety of tasks than image generation. For example, generating poetry with a particular rhyming structure or meter seems to be a pretty "pure" language task, yet even GPT-4 struggles with it. Meanwhile, diffusion models with a quarter of the parameter count of GPT-4 can output art in a dizzying variety of styles, from Raphael-like neoclassical realism to Picasso-like cubism.

[-]baturinsky3y30

Our brains were not trained for image generation (much). They were trained for converting 2d image into the understanding of the situation. Which AI still struggles with and needs the help of LLMs to have anywhere good results.

[-]DragonGod3y31

On the other hand, the tens of billions of tokens fed to LLMs is orders of magnitude beyond what humans could ever experience.

Nitpick, but one of the LLaMA versions was trained on 1.4T tokens and GPT-4 probably had an even larger training dataset.

"Trillions of tokens" feels more accurate for SOTA language models.

[-]Brendan Long3y20

I also think (4) is a big piece of this. GPT-4 isn't that good at programming, but it's shockingly good for someone with no long-term memory doing everything in their head with limited or no ability to self-reflect.

[-]Bill Benzon3y21

Try transforming the language task and the image task into the same format and comparing the two. It's easy to rasterize images so that they become transformed into strings of colored dots. For language, replace each token with a colored dot and do so uniformly across all texts. You have now transformed each text into a string of colored dots.

Now take, say, the first billion dots from your (transformed) image collection and the first billion dots from your (transformed) text collection. Which string has the higher entropy?

[-]DragonGod3y20

Thoughts:

Strongly agree re the unfair comparison
Don't know enough neuroscience or machine learning to have an opinion here
Same as above
Seems unlikely. Capabilities emerging at particular levels of scale (this phenomenon seems present even in GPT-3.5 -> GPT-4) makes me doubt that GPT-2 has any sort of human level linguistics competence. That said, the task that GPT performs when modelling language (next token prediction), is not the same task that humans perform, and it seems likely that GPT-2 was already superhuman at that task? Alternatively, performance on downstream tasks is not an "ecological evaluation" of GPT.
LLaMA demonstrates that it is possible to tradeoff training data size against model size and attain performance gains even with Chinchilla over trained models; this also seems to reduce the inference compute cost

[-]DragonGod3y20

Let's contrast this to the brain. Previously

The hyperlink for "Previously" is broken.

[-]Qumeric3y10

I am not sure how did you come to the conclusion that current models are superhuman. I can visualize complex scenes in 3D for example. Especially under some drugs :)

And I don't even think I have an especially good imagination.

In general, it is very hard to compare mental imagery with Stable Diffusion. For example, it it is hard to imagine something with many different details in different parts of the image but it is perhaps a matter of representation. An analogy could be that our perception is like a low-resolution display. I can easily zoom in on any area and see the details.

I wouldn't say that current models are superhuman. Although I wouldn't claim humans are better either, it is just very unobvious how to compare it properly and probably there are a lot of potential pitfalls.

So 1) has a large role here.

In 2) CNNs are not a great example (as you mentioned yourself). Vision transformers demonstrate similar performance. It seems that inductive bias is relatively easy to learn for neural networks. I would guess it's similar for human brains too although I don't know much about neurobiology.

3) Doesn't seem like a good reason to me. There are modern GANs that demonstrate similar performance to diffusion models, also there are approaches which make diffusion work in a very small number of steps, even 1 step showed decent results IIRC. Also, even ImageGPT worked pretty well back in the day.

4) Similarly to the initial claim, I don't think much can be confidently said about LLM language abilities in comparison to humans. I do not know what exactly it means and how to measure it. We can do benchmarks, yes. Do they tell us anything deep? I don't think so. LLMs are very different kinds of intelligence, they can do many things humans can't and vice versa.

But at the same time, I wouldn't say that visual models strike me as much more capable given the same size/same amount of compute. They are quite stupid. They can't count. They can't do simple compositionally.

5) It is possible we will have much more efficient language models, but again, I don't think they are much more inefficient than visual models.

My two main reasons for the perceived efficiency difference:

It is super hard to compare with humans. We may do it completely wrong. I think we should aspire to avoid it unless absolutely necessary.
"Language ability" depends much more on understanding and having a complicated world model compared to "visual ability". We are not terribly disappointed when Stable Diffusion consistently draws three zombies when we ask for four and mostly forgive it for weird four-fingered hands sometimes growing from the wrong places. But when LLMs do similar nonsense, it is much more evident and hurts performance a lot (both on benchmarks and in the real world). LLMs can imitate style well, they have decent grammar. Larger ones GPT-4 can even count decently well and probably do some reasoning. So the hard part (at least for our current deep learning methods) is the world model. Pattern matching is easy and not really important in the grand scheme of things. But it still looks kinda impressive when visual models do it.

Moderation Log