Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Perhaps these can be thought of as homework questions -- when I imagine us successfully making AI go well, I imagine us building expertise such that we can answer these questions quickly and easily. Before I read the answers I'm going to think for 10min or so about each one and post my own guesses.

Useful links / background reading: The glorious EfficientZero: How it Works. Related comment. EfficientZero GitHub. LW discussion.

Some of these questions are about EfficientZero, the net trained recently; others are about EfficientZero the architecture, imagined to be suitably scaled up to AGI levels. "If we made a much bigger and longer-trained version of this (with suitable training environment) such that it was superhuman AGI..."

  1. EfficientZero vs. reward hacking and inner alignment failure:
    1. Barring inner alignment failure, it’ll eventually reward hack, right? That is, if it gets sufficiently knowledgeable and capable, it’ll realize that it can get loads of reward by hacking its reward channel, and then its core algorithm would evaluate that action/plan highly and do it. Right?
    2. But fortunately (?) maybe there would be an inner alignment failure and the part of it that predicts reward would predict low reward from that action, even in the limit of knowledge and capability? Because it’s learned to predict proxies for reward rather than reward itself, and continued to do so even as it got smarter and more capable? (Would this happen? Why would this not be corrected by further training? Has some sort of proxy crystallization set in? How?)
  2. EfficientZero approximates evidential decision theory, right?
  3. EfficientZero is a consequentialist (in the sense defined here) architecture, right? It’s not, for example, updateless or deontological. For example, it has no deontological constraints except by accident (i.e. if its predictor-net mistakenly predicted super low reward for actions of type X, always, even in cases where actually a reasonable intelligent predictor would predict high reward.) Right?
  4. What is the most complex environment AIs in the family of MuZero, EfficientZero, etc. have been trained on? Is it just some Atari game?
  5. Roughly how many parameters does EfficientZero have? If you don’t know, what about MuZero? What about the biggest net to date from that general family? The EfficientZero paper doesn't give a direct answer but it describes the architecture in enough detail that you might be able to calculate it...
  6. If we kept scaling up EfficientZero by OOMs in every way, what would happen? Would it eventually get to agenty AGI? / APS-AI? After all, it seems pretty sample-efficient already. What if its sample was an entire lifetime of diverse experiences?
New Answer
New Comment

2 Answers sorted by


  1. a/b Does efficient zero reward hack? Not necessarily. If by 'reward hacking', you mean some action in the environment that directly stimulates the true reward signal (equivalent to perhaps sex/food/fame/power/etc in humans), then potentially yes - it would consider plans (up to the limits of it's admittedly limited planning horizon) and execute those that maximize reward. Which in this case is indistinguishable from reward 'hacking'. But if by hacking you mean some impaction of physics into it's mind that causes value prediction errors inside the planning rollout evaluation (more like say dopamine drugs in humans), then it's policy is likely more human like. It will avoid reward hacking traps to the extent it has a good model of how they are in fact hacks that don't lead to true reward, but could fall into a reward hacking trap otherwise.
  2. EfficientZero approximates evidential decision theory, right? Hmm not sure yet.
  3. EfficientZero is a consequentialist? (I haven't read the linked article yet) Consequentialist vs deontological is a poor philosopher's map of the temporal continuum of the value function (reward prediction). Model-free is basically constrained to near-sighted reward prediction and is thus 'deontological'. This is model-based, so more 'consequentalist' - but again it's continuous.
  4. What is the most complex environment . . trained on? Publically? Atari. This is a brand new paper. Internally within DeepMind? I'm betting robotics and more complex games. With more tweaks and scaling up this could probably do very well in starcraft, but DM has already burned that publicity stunt with a brittle model-free false champion.
  5. Roughly how many parameters does EfficientZero have? They don't list it directly in the paper but i'm guessing between 1M to 10M based on glancing over the arch details in the appendix, it's pretty tiny.

"Generally, compared with MuZero [31], we reduce the number of residual blocks and the number of planes as we find that there is no capability issue caused by much smaller networks in our EfficientZero with limited data. In another word, such a tiny network can acquire good performance in the limited setting."

6. If we kept scaling up EfficientZero by OOMs in every way, what would happen? Would it eventually get to agenty AGI? 

This model is in a radically different region of mindspace than animals or most DL agents.  It's definitely already consequentalist-agenty, as it's full model-free planning. But it has just tiny bitty insect size brain that's just big enough to learn one Atari game, but then it does super expensive insane superhuman quantities of MCTS planning rollouts with that tiny predictive world model.  So as a first guess, I'd model it's planning expense multilpier as 4 GPUs / atari_cost or > 10^8.  Simulating a good approx model of the real world would probably then require a model that's about 10^8 larger (current GPU flops / atari flops) - assuming current GPUs have achieved photorealism, which they have. Also on top of that it's planning is short horizon inefficient compared to humans (I have ideas how to fix that, not going to post here obviously just in case).  This is a demonstrator, there is more work to get on a better scaling curve.  Humans don't evaluate huge branching planning spaces explicitly MCTS style, it doesn't scale well with increasing world complexity.




I didn't answer question 2 yet because I'm not very familiar with evidential decision theory and didn't have time to read up on it yet.
EDT seems to mean something different every time someone writes an article to prop it up against the argument, made from its infancy, that it recommends trying to change the world by managing the news you receive of it. When the decider neither acts on the world nor is acted on by it (save for having somehow acquired knowledge about the world), there is only maximisation of expected utility, and no distinction between causal, evidential, or any other decision theory. When the decider is embedded in the world, this is called "naturalized decision theory", but no-one has a mature example of one. In the special case where the decider can accurately read the world and act on it, but the world has neither read nor write access to the decision-making process, and no agency in what the decider can know, CDT is correct. In the special case where the decider does not decide at all, but is a passive process that can do nothing but observe its own actions, then EDT is correct. CDT two-boxes on Newcomb and smokes in the Smoking Lesion problem. EDT (as originally formulated) one-boxes and abstains. On LessWrong, I believe the general consensus, or at least, Eliezer's belief, is to one-box and smoke, ruling out both theories. But there is at least one author (see Eells (1981, 1982) discussed here) arguing that EDT (or his formulation of it) two-boxes on Newcomb. Recalling a Zen koan, a CDT agent is not subject to causation and an EDT agent is subject to causation, but a naturalized decision theory must be one with causation.

Minor note: I think you meant that it does model-based planning -- this is what the graph search means. Also see the paper

" We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero." 

Daniel Kokotajlo


OK, here are my guesses, without seeing anyone else's answers. I think I'm probably wrong, which is why I'm asking this question:

1.a. Underdetermined? It depends on what we mean by the outer objective, and what we mean when we assume it has no inner alignment problems? See e.g. this discussion. That said, yeah it totally seems possible. If the part that predicts reward gets good at generalizing, it should be able to reason/infer/guess that hacking the reward function would yield tons of reward. And then that's what the agent would do.

1.b. Yes? Even though I usually think of inner alignment failures on the context of black-box neural nets and this is a bit more transparent (because of the tree search at the heart) I think the usual arguments would apply?

2. Yes? My limited, probably faulty understanding is that the policy is trained to approximate what you get when you search the tree of possible actions and predicted consequences some depth and then look at how good or bad the resulting states seem to be. There's nothing in there about doing causal interventions or do-calculus or whatnot; so I assume that the predicted-consequences network would approximate conditionalization... however Caspar Oesterheld has a paper arguing I think that under some conditions neural nets approximate CDT in the limit, so *shrugs*. I should go read it again and see if it applies.

3. Yes? Again, if it's approximating what you get when you search the tree and pick the action that leads to the best predicted state, that sure seems pretty consequentialist. If initially there was some "never lie" heuristic, then wouldn't it disappear quickly once you encountered situations where lying led to a better predicted outcome? Sure, you could hack it by always predicting bad outcomes from lying, but that contradicts our "sufficiently knowledgeable and capable" hypothetical.

4. I don't know but from what I've seen the answer is yeah, just Atari and stuff like that.

5. I don't know but it seems EfficientZero took 7 hours on 4 GPUs to do one 2-subjective-hour training run. That gives us a sense of how many FLOPS it must run at (maybe 5 x 10^14 operations per subjective second?), but presumably it has many operations per parameter... I guess it would have made news if it had parameter count comparable to large language models, so maybe it has something like 10^8 parameters? But surely they aren't doing a million ops per parameter per subjective second? Gosh I really am just guessing.

6. I see no obstacle, it seems like a quantitative question of how much bigger we'd need to make it, how much longer we'd need to train it, and maybe what sort of data we'd have to give it. Qualitatively it seems like this would probably keep scaling to superintelligence if we poured enough into it (but this is probably true for lots of architectures, and doesn't mean much in practice since we don't have galaxy-sized computers.)

Just real quick

5. So at a high level we know exactly how many flops it takes to simulate atari - it's about 10^6 flop/s vs perhaps 10^12 flops/s for typical games today (with the full potential of modern gpus at 10^14 flops/s, similar to reality). So you (and by you I mean DM) can actually directly compare - using knowledge of atari, circuits, or the sim code - the computational cost of the learned atari predictive model inside the agent vs the simulation cost of the (now defunct) actual atari circuit. There isn't much uncertainty in that calculation - bot... (read more)

10 comments, sorted by Click to highlight new comments since:

it seems pretty sample-efficient already

Maybe I'm confused, but I'm not ready to take that for granted. I think it's a bit subtle.

Let's talk about chess. Suppose I start out knowing nothing about chess, and have a computer program that I can play that enforces the rules of chess, declares who wins, etc.

  • I play the computer program for 15 minutes, until I'm quite confident that I know all the rules of chess.
  • …Then I spend 8000 years sitting on a chair, mentally playing chess against myself.

If I understand correctly, "EfficientZero is really good at Atari after playing for only 15 minutes" is true in the same sense that "I am good after chess after playing for only 15 minutes" in the above scenario. The 8000 years doesn't count as samples because it was in my head.

Someone can correct me if I'm wrong…

(Even if that's right, it doesn't mean that EfficientZero isn't an important technological advance—after all, there are domains like robotics where simulation-in-your-head is much cheaper than querying the environment. But it would maybe deflate some of the broader conclusions that people are drawing from EfficientZero, like how AI compares to human brains…)


It's a little bit less dramatic than that: the model-based simulation playing is interleaved with the groundtruth environment. It's more like you spend a year playing games in your head, then you play 1 30s bullet chess match with Magnus Carlsen (madeup ratio), then go back to playing in your head for another year. Or maybe we should say, "you clone yourself a thousand times, and play yourself at correspondence chess timescales for 1 game per pair in a training montage, and then go back for a rematch".

(The scenario where you play for 15 minutes at the beginning, and then pass a few subjective eons trying to master the game with no further real data, would correspond to sample-efficient "offline reinforcement learning" (review), eg / , . Very important in its own right, of course, but poses challenges of its own related to quality of those 15 minutes - what if your 15 minute sample doesn't include any games which happen to use en passant, say? How could you ever learn that should be part of your model of chess? When you interleave and learn online, Carlsen might surprise you with an en passant a few games in, and then all your subsequent imagined games can include that possible move and you use it yourself.).

But it would maybe deflate some of the broader conclusions that people are drawing from EfficientZero, like how AI compares to human brains…

I don't think it does. If humans can't learn efficiently by imagining hypothetical games like machines can, so much the worse for them. The goal is to win.

Like Gwern said, the goal is to win. If EfficientZero gets superhuman data-efficiency by "cheating," well it still got superhuman data-efficiency...

I think a relevant comparison here would be total cost. EfficientZero took 7 hours on 4 GPUs to master each particular Atari game. Equipment probably cost around $10,000 last I checked. How long is the lifespan of a GPU? Two years? OK, so that's something like 20,000 hours of 4 GPU's time for $10K, so $0.50 for one hour, so $3.50 for the training run? Eh, maybe it's a bit more than that due to energy costs or something. But still it seems probable that it beats human wages.

For computational expenditure cost... I think 4 GPUs would be doing something like 10^14 FLOPS, which is one OOM less than the human brain?

All that being said, I do take your point that if matters if EfficientZero is getting results via an importantly different method than the human method, because that should give us a shred of hope that what EfficientZero is doing, while equally viable for this task, would be less viable for really important or dangerous tasks. However it could also be more viable. (For example, maybe the human method is "draw upon decades of pre-training and background knowledge. Also benefit from the fact that the game was literally designed for humans." Plausibly, that's our secret sauce. In which case EfficientZero is just... qualitatively smarter than us in some sense. Less dollars, less FLOPs, etc. Compensates for the lack of secret sauce by thinking really hard about its experiences, in a way we simply can't because our brains aren't fast enough.)

Hmm, I think my comment came across as setting up a horse-race between EfficientZero and human brains, in a way that I didn't intend. Sorry for bad choice of words. In particular, when I wrote "how AI compares to human brains", I meant in the sense of "In what ways are they similar vs different? What are their relative strengths and weaknesses? Etc.", but I guess it sounded like I was saying "human brain algorithms are better and EfficientZero is worse". Sorry.

I could write a "human brain algorithms are fundamentally more powerful than EfficientZero" argument, but I wasn't trying to, and such an argument sure as heck wouldn't fit in a comment. :-)

If EfficientZero gets superhuman data-efficiency by "cheating," well it still got superhuman data-efficiency...

Sure. If Atari sample efficiency is what we ultimately care about, then the results speak for themselves. For my part, I was using sample efficiency as a hint about other topics that are not themselves sample efficiency. For example, I think that if somebody wants to understand AlphaZero, the fact that it trained on 40,000,000 games of self-play is a highly relevant and interesting datapoint. Suppose you were to then say "…but of those 40,000,000 games, fundamentally it really only needed 100 games with the external simulator to learn the rules. The other 39,999,900 games might as well have been 'in its head'. This was proven in follow-up work.". I would reply: "Oh. OK. That's interesting too. But I still care about the 40,000,000 number. I still see that number as a very important part of understanding the nature of AlphaZero and similar systems."

(I'm not sure we're disagreeing about anything…)


One thing to note is that you don't know how many games humans are playing in their head in some sense. We don't have access to that kind of information about our own algorithms. Even if you think we don't because we don't consciously experience/remember them, that's obviously wrong. Every time you have a thought pop out of nowhere or an eureka! moment from the incubation effect, or every time you have a Tetris effect dream (or all the experience-replay hippocampus neuroscience), you see how it feels to have powerful subconscious algorithms churning away on difficult problems without you having any awareness of it: nothing. But they still take wallclock years to reach levels of performance that something like AlphaZero does in hours...

That's an interesting thought. My hunch is that hippocampal replay can't happen unconsciously because if the hippocampus broadcasts a memory at all, it broadcasts it broadly to the cortex including GNW. That's just my current opinion, I'm not sure if there's neuroscience consensus on that question.

Here I'm sneaking in an assumption that "activity in the GNW" = "activity that you're conscious of". Edge-cases include times when there's stuff happening in the GNW, but it's not remembered after the fact (at least, not as a first-person episodic memory). Are you "conscious" during a dream that you forget afterwards? Are you "conscious" when you're 'blacked out' from drinking too much? I guess I'd say "yes" to both, but that's a philosophy question, or maybe just terminology.

If we want more reasons that human-vs-EfficientZero comparisons are not straightforward, there's also the obvious fact that humans benefit from transfer-learning whereas EfficientZero starts with random weights.

It's EfficientZero, EfficientNet is an entirely different model architecture in computer vision.

Oops, thanks, just fixed it.

AZ was a population of agents reinventing chess/go from scratch, so on a #game comparison basis is more similar to all human expert/pro games over history.

But yeah, EZ is basically a superspeed GPU atari simulator combined with MCTS.

A single 3090 can do 250 TF ideally with tensorcores, and A100 is 2x that, so 4 GPUs is > 10^15 flops theoretical.  I'd also argue the brain is closer to 10^14, but this comparison is all kinda muck because they are so different. And as of today the GPU only hits those numbers on big dense matrix codes, where the brain is more fully sparse, so that's probably another 2 to 4 OOM advantage for the brain.

Yes, thinking hard and fast in small simulated subspaces is a AGI/SIM superpower - related old post. But it's still technically quantitative?