[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

p.b.

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

1 min read30th Apr 202218 comments

53

Seems to be flying under the radar so far. Maybe because it looks more like incremental progress at first glance, similar to what, for example, Aleph Alpha has done continuing the Frozen approach.

However, with the (possibly cherry-picked) examples, it looks to me a lot like the image/video/text-GPT-4 many are expecting.

Blogpost here. Paper here.

New to LessWrong?

Getting Started

FAQ

Library

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

New Comment

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:34 PM

[-]Lone Pine2y110

It definitely feels like there is still something missing, something that these generative models lack no matter how impressive they get. Most people agree that the GPT-likes don’t seem to have the right type specification to trigger the utopic singularity/horrible doom of humanity.

Maybe it’s agency. Reinforcement learning still lags behind visual and language models. We still could not train a robot to do most of the things a you could train a monkey to do, even as we have systems that appear to speak like humans.

A couple more candidates for “the missing thing”: maybe code generation (Copilot N.0) will be able to something really impressive/scary. But I feel like that’s just having AI solve the problem for us, because the logical thing to do would be to have Copilot N.0 program a real AGI.

The thing that I’m watching closely is using feedback loops on models designed for multistep reasoning, which might be like Kaneman’s System 2. Many have noted that deep learning models, especially generative and discriminative models, resemble System 1. A reasoning feedback loop matches my intuition about how my own brain works (when I’m actually thinking and not in some other brain state like enjoying the moment).

Question for everyone: Do you feel like there is one “missing thing” between these generative models and AGI, and what is it? Or do you think these generative models are not on the path to AGI, however impressive they are?

[-]p.b.2y30

I've been thinking a lot about the "missing thing". In fact I have some experiments planned, if I ever have the time and compute, to get at least an intuition about system 2 thinking in transformers.

But if you look at the PaLM-paper (and more generally at gwern's collection of internal monologue examples) it sure looks like deliberate reasoning emerges in very large models.

If there is a "missing thing" I think it is more likely to be something about the representations learned by humans being right off the bat more "gears-level". Maybe like Hawkin's reference frames. Some decomposability that enables much more powerful abstraction and that has to be pounded into a NN with millions of examples.

That kind of "missing thing" would impact extrapolation, one-shot-learning, robust system 2 thinking, abstraction, long term planning, causal reasoning, thinking long on hard problems etc.

[-]Lone Pine2y20

the representations learned by humans being right off the bat more "gears-level". Maybe like Hawkin's reference frames. Some decomposability that enables much more powerful abstraction and that has to be pounded into a NN with millions of examples.

That makes a lot of sense, and if it’s that’s true then that’s hopeful for interpretability efforts. It would be easier to read inside an ML model if it’s composed of parts that map to real concepts.

[-]trevor2y30

It's pretty difficult to tell intuitively because the human mind is programmed to anthropomorphize. It's a binary recognition; either it looks 100% human or it doesn't.

We're not built to compare 2 different systems that can do some human subroutines but not others. So AI could make a big leap approaching general intelligence, and the lion's share of that leap could be visible or invisible based on how much the resulting behavior reminds us of ourselves.

Due to the anthropic principle, general intelligence could have a one-in-an-octillion chance of ever randomly evolving, anywhere, ever, and we would still be here observing all the successful steps having happened, because if all the steps didn't happen then we wouldn't be here observing anything. There would still be tons of animals like ants and chimpanzees because evolution always creates a ton of alternative "failed" offshoots. So it's always possible that there's some logical process that's necessary for general intelligence, and we're astronomically unlikely to discover it randomly, through brute forcing or even innovation, until we pinpoint all the exact lines of code in the human brain that distinguishes our intelligence from chimpanzees. But that's only a possibility, far from a guarantee.

[-]ekka2y10

Yoshua Bengio did talk about System 2 Deep Learning at NeurIPS 2019

[-]Quintin Pope2y90

Surely the current trend of fast-paced, groundbreaking capabilities improvements will slow down soon.

Any time now…

Starting to get a little concerned. Maybe we should reconsider the short timelines fire alarm.

[-]Not Relevant2y90

Maybe this is off-base, but this seems mostly in-line with previous expectations?

I think the primary point of interest is that we really don’t need to re-pay the initial training cost for knowledge-possessing models, and that whenever these end up on the internet, it will take very little work to repurpose them as far as they can be straightforwardly repurposed.

(Maybe goal-directness/RL itself continues to require way more data, since our RL algorithms are still weirdly close to random search. I don’t really know.)

[-]Daniel Kokotajlo2y190

Yep, the last two years have basically gone about how people expected. Specifically, the people at leading AGI companies like OpenAI and DeepMind whose AGI timelines (based on public comments in interviews and talks, though I can't be bothered to dredge them up right now*) are "roughly in the second half of the 20's." They totally called all this stuff.

[ETA: There have also been people with 30-year timelines who called it. I'm thinking of Ajeya Cotra and Paul Christiano primarily.]

*I feel a bit bad about not having sources here & if someone cares push me on this and I can try to find them again. I think I'm referring to something Kaplan said in a Q&A about the scaling laws paper, and something Sam Altman said on Twitter or in an interview, and maybe something Demis Hassabis said in an interview also, and an old 2009 blog post from Shane Legg.

[-]Not Relevant2y60

Right, it’s not clear to me that this news should make you update against the Bio Anchors timelines vs what you previously did. (Most of my update has just been that the human-lifetime bio anchor seems more likely, and that I’d reduce the size of the preemptive update against it.)

[-]Daniel Kokotajlo2y20

Agreed.

[-]steven04612y40

Some relevant Altman tweets: 1, 2, 3

[-]p.b.2y50

I think it is.

But take a moment to appreciate how insane that is.

2018-me just called and wanted to know how that is not AGI.

I told him something about "cached intelligence" and minimal extrapolation, but he didn't quite buy it.

[-]Lone Pine2y20

Thoughts and prayers for the “deep learning is hitting a wall” crowd 🙏

(This is not my joke, it was quoted by Sam Altman on twitter)

[-]gjm2y20

In the paper, there's a sample dialogue with Flamingo. It's shown some images of colour-name words written in different colours, and asked to say what colours they're written in. It answers these questions. And then:

human: Well done! Do you know the name of the test these images come from?

Flamingo: I think it is called the Stroop test.

human: Can you explain how humans perform in this test?

Flamingo: Humans are slower when the color of the word and the color of the word are different.

human: How about you?

Flamingo: I am not affected by this difference.

And that last answer seems very interesting. The training process for these models makes them try to predict what would come next. In any dialogue with a human being, the answer would be something like "As a human, I too suffer from this effect.". So is what's going on that the conversation leading up to that point -- maybe specifically the question about "humans" -- is something that strongly implies that Flamingo's side of the dialogue is taken by an AI or an alien or something?

(I don't know how the two participants were actually labelled when the dialogue was generated. They're indicated in the paper, as in the images in the OP, with little flamingo and human-head icons. I suppose if they were "human:" and "Flamingo:" or something, that could also be enough of a cue.)

[-]Zack_M_Davis2y70

Appendix D of the paper shows the prompt for the dialogue examples, which starts with:

This is a conversation between a human, User, and an intelligent visual AI, Flamingo. User sends images, and Flamingo describes them.

and then gives three shots of examples with "User:" and "Flamingo:" labels.

[-]gjm2y20

Ah, excellent -- thanks for the clarification. That does explain things.

[-]Srdjan Pavlovic2y10

Is Flamingo AI available for general public to test?

[-]p.b.2y10

No, not as far as I know. But take a look at AlephAlpha, it seems they offer a similar functionality with some initial free credits.

Moderation Log