SoerenMind's Comments

[AN #76]: How dataset size affects robustness, and benchmarking safe exploration by measuring constraint violations

Potential paper from DM/Stanford for a future newsletter:

It addresses the problem that an RL agent will delude itself by finding loopholes in a learned reward function.

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.

One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?

Seeking Power is Provably Instrumentally Convergent in MDPs

If X is "number of paperclips" and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise "1-Y" would go in the opposite direction which can't be true by symmetry. But if Y is something like "number of happy people", Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.

Seeking Power is Provably Instrumentally Convergent in MDPs

I should've specified that the strong version is "Y decreases relative to a world where neither of X nor Y are being optimized". Am I right that this version is not true?

Seeking Power is Provably Instrumentally Convergent in MDPs

Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart's law that says "if X is a proxy for Y and you optimize X, the correlation breaks" but we really mean a stronger version: "if you optimize X, Y will actively decrease". Your paper clarifies that what we actually mean is an intermediate version: "if you optimize X, it becomes a harder to optimize Y". My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?

[AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee

My tentative view on MuZero:

  • Cool for board games and related tasks.
  • The Atari demo seems sketchy.
  • Not a big step towards making model-based RL work - instead, a step making it more like model-free RL.


  • A textbook benefit for model-based RL is that world models (i.e. prediction of observations) generalize to new reward functions and environments. They've removed this benefit.
  • The other textbook benefit of model-based RL is data efficiency. But on Atari, MuZero is just as inefficient as model-free RL. In fact, MuZero moves a lot closer to model-free methods by removing the need to predict observations. And it's roughly equally inefficient. Plus it trains with 40 TPUs per game where other algorithms use a single GPU and similar training time. What if they spent that extra compute to get more data?
  • In the low-data setting they outperform model-free methods. But they suspiciously didn't compare to any model-based method. They'd probably lose there because they'd need a world model for data efficiency.
  • MuZero only plans for K=5 steps ahead - far less than AlphaZero. Two takeaways: 1) This again looks more similar to model-free RL which has essentially K=1. 2) This makes me more optimistic that model-free RL can learn Go with just a moderate efficiency (and stability?) loss (Paul has speculated this. Also, the trained AlphaZero policy net is apparently still better than Lee Sedol without MCTS).
AlphaStar: Impressive for RL progress, not for AGI progress
the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion)

Curious where this estimate comes from?

AlphaStar: Impressive for RL progress, not for AGI progress

Why just a 10x speedup over model free RL? I would've expected much more.

[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety

Should I share the Alignment Research Overview in its current Google Doc form or is it about to be published somewhere more official?

Load More