Potential paper from DM/Stanford for a future newsletter: https://arxiv.org/pdf/1911.00459.pdf
It addresses the problem that an RL agent will delude itself by finding loopholes in a learned reward function.
Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.
One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?
If X is "number of paperclips" and Y is something arbitrary that nobody optimizes, such as the ratio of number of bicycles on the moon to flying horses, optimizing X should be equally likely to increase or decrease Y in expectation. Otherwise "1-Y" would go in the opposite direction which can't be true by symmetry. But if Y is something like "number of happy people", Y will probably decrease because the world is already set up to keep Y up and a misaligned agent could disturb that state.
I should've specified that the strong version is "Y decreases relative to a world where neither of X nor Y are being optimized". Am I right that this version is not true?
Thanks for writing this! It always felt like a blind spot to me that we only have Goodhart's law that says "if X is a proxy for Y and you optimize X, the correlation breaks" but we really mean a stronger version: "if you optimize X, Y will actively decrease". Your paper clarifies that what we actually mean is an intermediate version: "if you optimize X, it becomes a harder to optimize Y". My conclusion would be that the intermediate version is true but the strong version false then. Would you say that's an accurate summary?
Posted a little reaction to this paper here.
My tentative view on MuZero:
the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion)
Curious where this estimate comes from?
Why just a 10x speedup over model free RL? I would've expected much more.
Should I share the Alignment Research Overview in its current Google Doc form or is it about to be published somewhere more official?