I don't have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn't refined enough to appreciate those differences. For any particular dumb mistake I'd be surprised if the line between not making it and making it was in that particular doubling.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that "imitation learning from human games" will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don't see this as affecting the basic point.)
Thanks! That's only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP's example was legit and I overestimated the RL agent.
But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it's not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!
It seems like AlphaStar played 90 ladder matches as Terran:
This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it's not clear whether RL would learn how to recover (I would expect it to, but not too strongly).
If it's easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I've been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar's skill.
It looks like they released 90 replays and played 90 ladder games so it should be possible to check.
The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it's one of the SL matches it's either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I'd check if I had SC2 installed.
(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that's been posted somewhere, I haven't looked and it's hard to know who to trust.)
I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from "if I see X, I do Y"?
(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I'd be willing to bet on <10x.)
From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it's going to take bigger models in order to reach a similar level of performance. I don't see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.
Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you'd only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It's the biggest known unknown on the way to AGI.
What's the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn't do any causal reasoning? That it only ever learns to play parts of the game tree it's seen during training? What does "genuine causal reasoning" even mean?
It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it's going to take a long time---AlphaStar was very expensive, and still only trained a policy with ~80M parameters.
From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.
In what sense is this information warfare or even misleading? The second sentence of the blog post says: "AlphaStar was ranked above 99.8% of active players," which seems quite clear. They seem to have done a pretty good job of making that comparison as fair as you could expect. What do they say or even imply which is highly misleading?
Perhaps they say "Grandmaster level," and it's possible that this gives a misleading impression to people who don't know what that term means in Starcraft? Though I think chess grandmaster also means roughly "better than 99.8% of ladder players," and the competitive player pools have similar size. So while it might be misleading in the sense that Chess has a larger player pool a smaller fraction of whom are competitive, it seems fairly straightforward.
(edited post to include)
I wrote this post imagining "strategy-stealing assumption" as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing "Under a strategy-stealing assumption, this AI would result in an OK outcome." The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn't yet appeared in print.)
I'd be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn't great. (And then I would probably be able to use that name in the description of this assumption as well.)