My impression was that thing that put Deepmind on the map was an AI that could play multiple Atari games. Lately there's been new Atari-playing AI (both from Deepmind and other companies) that are making the news. Are they doing basically the same thing 2014 Deepmind was doing but better? Are they doing a fundamentally different thing? Can someone explain the diff like I'm five?

New Answer
Ask Related Question
New Comment

2 Answers sorted by

Pinned by Raemon

I want to do a big, long, detailed explainer on the lineage of EfficientZero, which is fascinating, and the mutations it makes in that lineage. This is not that. But here's my attempt at a quick ELI5, or maybe ELI12

There are two broad flavors of reinforcement learning -- where reinforcement learning is simply "learning to act in an environment to maximize a reward / learning to act to make a number go up."

Model Free RL: This is the kind of execution algorithm you (sort of) execute when you're keeping a bike upright.

When keeping a bike upright, you don't form an image of the physics of the bike, and how turning the steering wheel a certain way will adjust the base of the bike relative to your center of gravity. Nor do you do this when learning to ride a bike. Instead, when learning, you do things, roughly guided by what you want to happen, and they cause you to fall off or not, in a very tight feedback loop, and eventually you learn to do the things that cause you not to fall off.

This kind of RL involves no detailed predictive model of the dynamics from the world. It instead learns, broadly, a kind of mapping from observations to actions which accomplish your goal without understanding the dynamics of the world.

Model Based RL: This is the kind of execution algorithm you (sort of) execute when programming fizzbuzz.

That is, you have a particular view of how code executes, and how the interpreter interacts with code that you write. You form an image of what kind of output you'd like, and think of what you need to do make the interpreter give you this output. Learning to write fizzbuzz involves repeatedly refining your view of how code runs until you can just plan out how to write fizzbuzz using that model of how code runs.

So this kind of RL involves a detailed predictive model of the world. It learns a detailed predictive model, which is used to generate hypothetical scenarios, and lets you plan for the world that you want by examining different hypothetical scenarios.

(Note that in actual human activity, pretty much all action is like a mix of the two-- this is a simplification. And that indeed many kinds of RL also mix the two.)

How does this relate to EfficientZero?

The original 2014 DeepMind Atari paper used exclusively model-free RL. Many subsequent successful big papers also used model-free RL. So the Dota2-playing agents by OpenAI used model-free RL; the AlphaStar Starcraft-playing agents by DeepMind used model-free RL; and even the recent open-ended learning paper by DeepMind used model-free RL.

And here you or any reader is probably like -- doesn't model free RL leave out a biiiiig part of human cognition? Don't I... like... sometimes... plan ahead, and think about how the world is, rather than acting on a sophisticated, semi-socially generated, generalizing algorithm that simply maps circumstances to actions? Don't I form gears-level models of the world and act on them, at least sometimes?

And of course this is so. But -- for a long time -- our ability to make model-based algorithms work just wasn't there, or wasn't there as much. (Vast oversimplification -- this has been an active area of research for years.)

And part of the problem with these model-free algorithms was that they can take tons of data to learn, probably because they were trying to solve problems humans solve in an at least partly model-based way, in an entirely model-free way. The StarCraft agents had, I think, something like the equivalent of two centuries of experience playing Starcraft -- their sample-efficiency is extremely bad. And similarly, even with this experience, they were rather brittle -- if you could get them into a sufficiently weird situation, such that their mapping of observations to actions didn't cover it, they started to act poorly because they had no model of what was going on to fall back on.

EfficientZero comes from a lineage of research that started with model-based execution, but where initially the model wasn't learned. This lineage starts -- if you have to choose a place to start -- with AlphaGo.

AlphaGo was DeepMind's first Go-playing agent. It -- and its successor, AlphaZero -- use a perfect model of the game it is playing to plan ahead. It uses neural network magic to prune the search, but fundamentally it plans (and learns) with a perfect, unlearned model of the game is is playing. Which is of course easy to make, because Go is perfectly deterministic.

MuZero is the big leap that followed AlphaGo -- if EfficientNet leads to AGI, MuZero might be the paper the historians give precedence to. It replaced the pre-built model of the environment in AlphaGo / AlphaZero with a learned model. This allowed the same algorithm to play Go and Atari, which was unprecedented. And the algorithm itself was (in my opinion) pretty elegant.

EfficientZero (the new paper) is a further reasonable extension of MuZero. It uses some tricks for learning a better model of the environment fast, and also for making the model of the environment more useful for long-term planning. These are interesting additions, and I want to cover them in depth, but it's important to know that the additions aren't anything fundamentally new -- they're more-of-the-same-kind-of thing, not a Copernican revolution.

But the upshot (finally) is that model based RL can now do (on average) better than a human on the Atari testbed, after playing for only 2 hours real world time. That is incredibly fast for an RL agent. This sample-efficiency is better than any model-free agent on the Atari testbed, despite the fact that (I think) much more research effort has gone into these model free agents.

And it's with an algorithm that people fundamentally expect to do better at difficult planning tasks than any model-free algorithm. I.e., we have priors that actual intelligence probably uses something vaguely like this. (There's also reason to think, imho, that existing implementations of exploration techniques -- i.e., curiosity -- would mesh really well with this algorithm, which is alarming.)

If in the next 18 months, someone tries applying to something like EfficientZero to a big RL problem like Starcraft, or Minecraft, or the DeepMind open-ended learning tasks, and people find an increase in sample-efficiency similar to the increase observed in Atari, then that's another blow against long horizons for AGI. If I worked at OpenAI, or DeepMind, I know that's exactly what I'd be doing.

TLDR: Reinforcement learning algorithms often learn realllly slowly, in terms of needing a lot of experience, or learn really stupidly, in terms of relying on model-free RL. EfficientZero learns quickly and with model-based RL. Will it scale? If so, long-term AGI projections take another hit.


I'm looking forward to that big, long, detailed explainer :)


Thanks! This was super helpful.

I'm not sure the level of sophistication you want, but here's an answer:

Performance on the games is much better and the amount of game time it takes for the AI to reach a certain performance is much lower. Yet fundamentally they are doing the same thing: solving a Markov Decision Process (abbreviated MDP) using Reinforcement Learning and Deep Learning. Pretty much any problem where there's anything resembling an "environment state" where you make "decisions" can be modelled as an MDP. Small MDPs can be exactly solved with dynamic programming. 

For instance, if you have a 10 by 10 grid and different "rewards" placed at each grid location with a player moving around on the grid collecting rewards, you can solve the path the player should take exactly. That is because the environment is small and has only 100 states, so it's easy to just store a table with 100 values with rules like "at location (4,3) go right with probability 0.9". The real world is much larger, and so you can't run the algorithm for solving the MDP exactly, what you need are approximations of this exact algorithm. 

The whole field of Deep Reinforcement Learning right now is basically how to make better and better approximations of this exact algorithm that work in the real world. In this sense almost everything Deepmind has been doing is the same: building better approximations for the same underlying algorithm. But the approximations have gotten way way better, lots of tricks and heuristics were developed and people understand how to make the algorithms run consistently now (at the beginning there was lots of "you need this particular random seed to make it work"). 

3 comments, sorted by Click to highlight new comments since: Today at 7:05 PM

Could you be concrete about what papers you consider newer and maybe also link to original deep-q paper you have in mind? (This might help someone answer the question)

Sorry, was being kinda lazy and hoping someone had already thought about this.

This was the newer Deepmind one:

I was motivated to post by this algorithm from China I heard about today:

I think this is the older deepmind paper:

The first thing you mention does not learn to play Atari, and is in general trained quite differently from Atari-playing AI's (as it relies on self-play to kind of automatically generate a curriculum of harder and harder tasks, at least for the some of the more competitive tasks in XLand).