Review

See also article in Financial Times

Apparently, a human (Kellin Pelrine, a solid player but not even a Go professional) was able to beat some state-of-the-art Go AIs (KataGo and Leela Zero) by learning to play an adversarial policy found using RL. Notice that he studied the policy before the match and didn't receive any AI advice during play.

I'm not surprised adversarial policies for Go AIs are possible, this is in line with previous results about RL and adversarial examples more generally. I am surprised this adversarial policy is teachable to humans without colossal effort.

This is some evidence against the "scaling hypothesis", i.e. evidence that something non-trivial and important is missing from modern deep learning in order to reach AGI. The usual counterargument to the argument from adversarial examples is: maybe if we could directly access a human brain, we could find adversarial examples against humans. I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

Notice also that (AFAIK) there's no known way to inoculate an AI against an adversarial policy without letting it play many times against it (after which a different adversarial policy can be found). Whereas even if there's some easy way to "trick" a Go professional, they probably wouldn't fall for it twice.

New Comment
33 comments, sorted by Click to highlight new comments since:

I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

I dunno. I don't feel like my intuitions say that. I try imagining that there's a specific Go professional at a specific time, and I write a program that plays it, travels back in time, plays it again, travels back in time again, etc. millions of times. Meanwhile I'm a Go expert but not professional, and I get to study the results. Maybe I would find that the professional happens to have a certain wrong idea or blind spot about Go that they've never noticed, and in a certain type of position they'll have a reliable repeatable oversight. Or maybe not, who knows.

(It's possible that my intuitions are being warped from too much watching Edge of Tomorrow & Groundhog Day & videogame tool-assisted speedruns :-P )

Notice also that (AFAIK) there's no known way to inoculate an AI against an adversarial policy without letting it play many times against it (after which a different adversarial policy can be found).

I think you're gesturing at a sample efficiency related argument against deep learning scaling to AGI. I'm personally sympathetic but I think it's controversial. My impression is that at least some people claim that sample efficiency is a problem that goes away with scale.

sample efficiency is a problem that goes away with scale

Citation?

I asked someone a couple weeks ago and they mentioned Fig. 4.1 of Language Models are Few-Shot Learners as an example of how bigger models can extract better inferences out of the same amount of training data. The Deep Double Descent stuff also shows bigger models getting better results out of fixed data.

Also, I vaguely recall hearing someone say that LLM gradient descent was approaching optimal Bayesian updates to the trained model on each token of training data, but I don’t remember who said it or what evidence they had, and am pretty skeptical of that. (If I’m even remembering right in the first place.)

Separately, I’ve seen claims that deep learning has human-level sample efficiency as proven by EfficientZero, but I don’t buy that.

I’m following this conversation in case someone has a better answer. :)

Relevant: https://www.reddit.com/r/mlscaling/search?q=sample-efficient+OR+data-efficient&restrict_sr=on&include_over_18=on&sort=relevance&t=all (Also, the mere existence of large slow models should tell you that they are more sample-efficient: because if they were both slower and less sample-efficient (ie. lowered loss less per datapoint than a smaller model), why would you ever train them?)

"Language Models are Few-Shot Learners" is some evidence towards the hypothesis that sample efficiency can be solved by metalearning, but the evidence is not that strong IMO. In order for it to be a strong counterargument to this example, it should be the case that an LLM can learn to play Go on superhuman level while also gaining the ability to recover from adversarial attacks quickly. In reality, I don't think an LLM can learn to play Go decently at all (in deployment time, without fine-tuning on a large corpus of Go games). Even if we successfully fine-tuned it to imitate strong human Go players, I suspect that it would be at least just as vulnerable to adversarial examples, probably much more vulnerable.

Deep double descent certainly shows increasing the model size increases performance, but I think that even with optimal model size the sample efficiency is still atrocious.

As to EfficientZero, I tend to agree with your commentary, and I suspect similar methods would fail for environments that are much more complex than Atari (especially environments that are more computationally expensive to simulate than the compute available for the training algorithm).

Just to make sure we’re on the same page, Fig. 4.1 was about training the model by gradient descent, not in-context learning. I’m generally somewhat averse to the term “in-context learning” in the first place; I’m skeptical that we should think of it as “learning” at all (as opposed to, say, “pointing to a certain task”). I wish people would reserve the term “learning” for the weight updates (when we’re talking about LLMs), at least in the absence of more careful justification than what I’ve seen.

In particular, instead of titling the paper “Language Models are Few-Shot Learners”, I wish they had titled it “Language Models Can Do Lots of Impressive Things Without Fine-Tuning”.

But Fig. 4.1 of that paper is definitely about actual learning.

In order for it to be a strong counterargument to this example, it should be the case that an LLM can learn to play Go on superhuman level …

I think there are so many disanalogies between LLMs-playing-Go and humans-playing-Go that it’s not even worth thinking about. ¯\_(ツ)_/¯ For example, humans can “visualize” things but LLMs can’t (probably). But OTOH, maybe future multi-modal next-gen LLMs will be able to.

More generally, I haven’t seen any simple comparison that provides air-tight evidence either way on sample-efficiency of deep learning versus human brains (and “deep learning” is itself is big tent—presumably some model types & sizes are more sample-efficient than others).

As it happens, I do believe that human brains are more sample efficient than any deep learning model. But my reasons for believing that are pretty indirect and I don’t want to talk about them.

But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

This hypothesis literally makes no sense to me. Why would adversarial policies for humans be infeasible for humans to learn? Why would it further be so infeasible as to be incredulous?

In a sense optical and other sensory illusions are adversarial exploits of human perceptions and humans can learn how to present sensory illusions very readily. Professional stage magicians are experts at deceiving human perception and may even discover new exploits not publicly disclosed.

We have not systematically run adversarial optimisation over human cognition to find adversarial policies.

Your scepticism seems very unwarranted to me.

There are plenty of incentives for people to find adversarial policies against other people. And, sure, there are stage magicians, but I don't know any stage magician who can do the equivalent of winning at Go by playing a weird sequence of moves that confuses the senses (without exploiting any degrees of freedom other than the actual moves). AFAIK there is no weaponized stage magic either, which would be very useful if it was that efficient. (Camouflage, maybe? It stills seems pretty different.) Of course, it is possible that we cannot find these adversarial policies only because we are not able to interface with a human brain as directly as the adversarial policy training algorithm. In other words, maybe if you could play a lot of games against a Go champion while resetting their memory after every game, you would find something eventually (even though they randomize their moves, so the games don't repeat). But, I dunno.

Of course, it is possible that we cannot find these adversarial policies only because we are not able to interface with a human brain as directly as the adversarial policy training algorithm.

IMO this point is very underappreciated. It's heavily load bearing that the adversarial policy could train itself against a (very) high fidelity simulation of the Go engine, do the ML equivalent of "reading its mind" while doing so, and train against successively stronger versions of the engine (more searches per turn) and for arbitrarily long.

We can't do any of these vs a human. Even though there are incentives to find adversarial exploits for human cognition, we can't systematically run an adversarial optimiser over a human mind the way we can over an ML mind.

And con artists/scammers, abusers, etc. may perform the equivalent of adversarial exploits on the cognition of particular people.

 

In other words, maybe if you could play a lot of games against a Go champion while resetting their memory after every game, you would find something eventually (even though they randomize their moves, so the games don't repeat). But, I dunno.

Not me, but a (very) strong Go amateur might be able to learn and adopt a policy that a compute limited agent found to beat a Go champion given such a setup (notice that it wasn't humans that discovered the adversarial policy even in the KataGo case).

I don't think they do the ML equivalent of "reading its mind"? AFAIU, they are just training an RL agent to play against a "frozen" policy. Granted, we still can't do that against a human.

Hmm, I think I nay have misunderstood/hallucinate the "reading its mind" analogy from an explained of the exploit I read elsewhere.

Interesting point about the scaling hypothesis.  My initial take was that this was a slightly bad sign for natural abstractions: Go has a small set of fundamental abstractions, and this attack sure makes it look like KataGo didn't quite learn some of them (liberties and capturing races), even though it was trained on however many million games of self-play and has some customizations designed to make those specific things easier.  Then again, we care about Go exactly because it resisted traditional AI for so long, so maybe those abstractions aren't as natural in KataGo's space as they are in mine, and some other, more generally-useful architecture would be better behaved.

Definitely we're missing efficient lifelong learning, and it's not at all clear how to get there from current architectures.

I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

I'm a bit confused on this point. It doesn't feel intuitive to me that you need a strategy so weird that it causes them to have a seizure (or something in that spirit). Chess preparation for example and especially world championship prep, often involves very deep lines calculated such that the moves chosen aren't the most optimal given perfect play, but which lead a human opponent into an unfavourable position. One of my favourite games, for example, involves a position where at one point black is up three pawns and a bishop, and is still in a losing position (analysis) (This comment is definitely not just a front to take an opportunity to gush over this game).

Notice also that (AFAIK) there's no known way to inoculate an AI against an adversarial policy without letting it play many times against it (after which a different adversarial policy can be found). Whereas even if there's some easy way to "trick" a Go professional, they probably wouldn't fall for it twice.

The kind of idea I mention is also true of new styles. The hypermodern school of play or the post-AlphaZero style would have led to newer players being able to beat earlier players of greater relative strength, in a way that I think would be hard to recognize from a single game even for a GM.

My impression is that the adversarial policy used in this work is much stranger than the strategies you talk about. It's not a "new style", it's something bizarre that makes no game-sense but confuses the ANN. The linked article shows that even a Go novice can easily defeat the adversarial policy.

Yeah, but I think I registered that bizarreness as being from the ANN having a different architecture and abstractions of the game than we do. Which is to say, my confusion is from the idea that qualitatively this feels in the same vein as playing a move that doesn’t improve your position in a game-theoretic sense, but which confuses your opponent and results in you getting an advantage when they make mistakes. And that definitely isn’t trained adversarially against a human mind, so I would expect that the limit of strategies like this would allow for otherwise objectively far weaker players to defeat opponents they’ve customised their strategy to.

I'm not quite sure what you're saying here, but the "confusion" the go-playing programs have here seems to be one that no human player beyond the beginner stage would have. They seem to be missing a fundamental aspect of the game. 

Perhaps the issue is that go is a game where intuitive judgements plus some tree search get you a long way, but there are occasional positions in which it's necessary to use (maybe even devise and prove) what one might call a "theorem".  One is that "a group is unconditionally alive if it has two eyes", with the correct definition of "eye".  For capture races, another theorem is that the group with more liberties is going to win.  So if you've got 21 liberties and the other player has 20, you know you'll win, even though this involves looking 40 moves ahead in a tree search.  It may be that current go-playing programs are not capable of finding such theorems, in their fully-correct forms.

These sort of large-scale capturing races do arise in real human-human games.  More so in games between beginners, but possible between more advanced players as well.  The capturing race itself is not a "bizarre" thing.  Of course it is not normal in a human-human game for a player to give away lots of points elsewhere on the board in order to set up such a capture race, since a reasonably good human player will be able to easily defend the targeted group before it's too late.

Qualifications:  I'm somewhere around a 3 dan amateur Go player.

In the paper, David Wu hypothesized one other ingredient: the stones involved have to form a circle rather than a tree (that is, excluding races that involve the edge of the board).  I don't think I buy his proposed mechanism but it does seem to be true that the bait group has to be floating in order for the exploit to work.

[Edit: this comment is probably incorrect, see gjm's reply.]

I don't think this should be much of an update about anything, though not sure. See the comments here. https://www.lesswrong.com/posts/jg3mwetCvL5H4fsfs/adversarial-policies-beat-professional-level-go-ais 

TL;DR: the adversarial policy wins because of something more like a technicality than like finding a strategy that the AI player "was supposed to beat". 

(I think there's in general a problem where results get signal-boosted on LW by being misrepresented by the authors, and then posters not reading carefully enough to catch that, or by being misrepresented by the posters, or something. No specific instance is super bad, and your post here might not be an instance, but it seems like a big pattern and is an indicator of something wrong with our epistemics. I mean it's reasonable for people to post on LW asking "are these representations by the authors accurate?", but posters tend to just say "hey look at this result! [insert somewhat hyped / glossed summary]". Not a huge deal but still.)

[-]gjm3111

This latest thing is different from the one described there. I think the same people are behind it, but it's a different exploit.

The old exploit was all about making use of a scoring technicality. The new one is about a genuine blindspot in (at least) KataGo and Leela Zero; what their networks learn about life and death through self-play is systematically wrong in a class of weird positions that scarcely ever occur in normal games. (They involve having cyclic "chains" of stones. The creator of KataGo has a plausible-sounding explanation for what sort of algorithm the networks of LZ and KG may be using, and why it would give wrong results in these cases.)

I would be very cautious about applying this to other AI systems, but it does match the following pattern that arguably is also common with LLMs: the AI has learned something that works well most of the time in practice, but what it's learned falls short of genuine understanding and as a result it's exploitable, and unlike a human who after being hit with this sort of thing once or twice would think "shit, I've been misunderstanding this" and try to reason through what went wrong, the AI doesn't have that sort of metacognition and can just be exploited over and over and over.

Cool! Thanks. 

AFAIU the issue is that the attack relies on a specific rule variant, but OTOH the AI was trained using the same rule variant so in this sense it's completely fair. Also I'm not entirely sure whether the attack used by Pelrine is the same one from before (maybe someone who understands better can comment).

[-]gjm1010

The attack used by Pelrine is not the same one as before and isn't all about a weird scoring technicality.

The comments in that post are wrong, the exploit does not rely on a technicality or specific rule variant.  I explained how to do it in my post, cross-posted just now with Vanessa's post here.

[+][comment deleted]20

I think an important difference between humans and these Go AIs is memory: If we find a strategy that reliably beats human experts, they will either remember losing to it or hear about it and it won't work the next time someone tries it. If we find a strategy that reliably beats an AI, that will keep happening until it's retrained in some way.

I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

While I can't say anything about Go, but con artists, social engineers, shoplifters, hypnotisers, cult recruiters etc, know many tricks designed to nudge you towards predictable mistakes.

What's interesting to me, it's that our first useful for pivotal act superintelligent AGI probably will have multiple such holes and it can be useful for a third-line-defence corrigibility.

I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat.

Or there's just different paths to get AGI that involve different weaknesses and blind spots? Human children also seem exploitable in lots of ways. Couldn't you argue similarly that Humans are not generally intelligent, because Alpha-beta pruning + some mediocre evaluation function beats them in chess consistently, and they are not even able to learn to beat it?

Human children also seem exploitable in lots of ways.

Also, more generally, there are many ways to manipulate humans into acting against their self-interest, which they may fail to adapt to even after suffering from it multiple times (people stuck in abusive relationships may be the most extreme example). You could probably call that an adversarial strategy that other humans routinely exploit.

This is some evidence against the "scaling hypothesis", i.e. evidence that something non-trivial and important is missing from modern deep learning in order to reach AGI.

The usual response is just "you don't actually need to be robust to white box advexes, and only somewhat resistant to black box advexes, to take over the world"

My point is not that there is a direct link between adversarial robustness and taking over the world, but that the lack of adversarial robustness is (inconclusive) evidence that deep learning is qualitatively worse than human intelligence in some way (which would also manifest in ways other than adversarial examples). If the latter is true, it certainly reduces the potential risk from such systems (maybe not to 0, but it certainly substantially weakens the case for the more dramatic take-over scenarios).