Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Last year, we showed that supposedly superhuman Go AIs can be beaten by human amateurs playing specific “cyclic” patterns on the board. Vulnerabilities have previously been observed in a wide variety of sub- or near-human AI systems, but this result demonstrates that even far superhuman AI systems can fail catastrophically in surprising ways. This lack of robustness poses a critical challenge for AI safety, especially as AI systems are integrated in critical infrastructure or deployed in large-scale applications. We seek to defend Go AIs, in the process developing insights that can make AI applications in various domains more robust against unpredictable threats.

We explored three defense strategies: positional adversarial training on handpicked examples of cyclic patterns, iterated adversarial training against successively fine-tuned adversaries, and replacing convolutional neural networks with vision transformers. We found that the two adversarial training methods defend against the original cyclic attack. However, we also found several qualitatively new adversarial strategies that can overcome all these defenses. Nonetheless, finding these new attacks is more challenging than against an undefended KataGo, requiring more training compute resources for the adversary.

For more information, see our blog post, project website or paper.

New Comment
2 comments, sorted by Click to highlight new comments since:

I speculated last time that convolutions might be inherently vulnerable; looks like I was wrong, although the ViT attack there does look distinctly different (so maybe in some sense that circle attack was convolutional)

I am not sure what is up with the 'gift' or 'atari' attacks, but the overall appearance of the attacks looks more like a horizon blindspot problem: they look like they push out the collapse enough steps that the feedforward network doesn't see any issue. And then perhaps it assigns such low values that the tree search never spots the disaster without an extremely large number of visits? Some qualitative work here on where the model+search goes wrong might be worthwhile now that there are multiple good attacks to compare. Are these all fundamentally the same or do they seem to trigger completely different pathologies?

(KataGo dev here, I also provided a bit of feedback with the authors on an earlier draft.)

@gwern - The "atari" attack is still a cyclic group attack, and the ViT attack is also still a cyclic group attack. I suspect it's not so meaningful to put much weight on the particular variant that one specific adversary happens to converge to. 

This is because the space of "kinds" of different cyclic group fighting situations is combinatorically large and it's sort of arbitrary what local minimum the adversary ends it because it doesn't have much pressure to find more once it finds one that works. Even among just the things that are easily put into words without needing a diagram - how big is the cycle? Does the cyclic group have a big eye (>= 4 points behaves tactically distinctly) or a small eye (<=3 points), or no eye? Is the eye a two-headed-dragon-style eye, or not? Does it have more loose connections or is it solid? Is the group inside locally dead/unsettled/alive? Is the cycle group racing against an outside group for liberties or only making eyes of its own, or both? How many liberties do all the various groups each have? Are there ko-liberties? Are there approach-liberties? Is there a cycle inside the cycle? etc. 

This is the same as how in Go the space of different capturing race situations in general is combinatorically large, with enough complexity that many situations are difficult even for pro players who have studied them for a lifetime.

The tricky bit here is that there seems to not be (enough) generalization between the exponentially large space of large group race situations in Go more broadly and the space of situations with cyclic groups. So whereas the situations in "normal Go" get decently explored by self-play, cyclic groups are rare in self-play so there isn't enough data to learn them well, leaving tons of flaws, even for some cases humans consider "simple". A reasonable mental model is that any particular adversary will probably find one or two of them somewhat arbitrarily, and then rapidly converge to exploit that, without discovering the numerous others.

The "gift" attack is distinct and very interesting. There isn't a horizon effect involved, it's just a straightforward 2-3 move blind spot of both the policy and value heads. Being only 2-3 moves this is why it also gets fixed more easily by search than cyclic groups. As for why it happens, as a bit of context I think these are true:

  • In 99.9...% of positions, the flavor of "superko" rule doesn't affect the value of a position or the correct move.
  • The particular shape used by the gift adversary and similar shapes do occur with reasonable frequency in real games without the superko rule being relevant (due to different order of moves), in which case the "gift shape" actually is harmless rather than being a problem.

I've taken a look at the raw neural net outputs and it's also clear that the neural net has no idea that the superko rule matters - predictions don't vary as you change the superko rule in these positions. So my best guess is that that the neural net perhaps "overgeneralizes" and doesn't easily learn that in this one specific shape with this specific order of moves, the superko rule, which almost never matters, suddenly does matter and flips the result.