From time to time, someone makes the case for why transparency in reasoning is important. The latest conceptualization is Epistemic Legibility by Elizabeth, but the core concept is similar to reasoning transparency used by OpenPhil, and also has some similarity to A Sketch of Good Communication by Ben Pace.

I'd like to offer a gentle pushback. The tl;dr is in my comment on Ben's post, but it seems useful enough for a standalone post.

How odd I can have all this inside me and to you it's just words.” ― David Foster Wallace

When and why reasoning legibility is hard

Say you demand transparent reasoning from AlphaGo. The algorithm has roughly two parts: tree search and a neural network. Tree search reasoning is naturally legible: the "argument" is simply a sequence of board states. In contrast, the neural network is mostly illegible - its output is a figurative "feeling" about how promising a position is, but that feeling depends on the aggregate experience of a huge number of games, and it is extremely difficult to explain transparently how a particular feeling depends on particular past experiences. So AlphaGo would be able to present part of its reasoning to you, but not the most important part.[1]

Human reasoning uses both: cognition similar to tree search (where the steps can be described, written down, and explained to someone else) and processes not amenable to introspection (which function essentially as a black box that produces a "feeling"). People sometimes call these latter signals “intuition”, “implicit knowledge”, “taste”, “S1 reasoning” and the like. Explicit reasoning often rides on top of this.

Extending the machine learning metaphor, the problem with human interpretability is that "mastery" in a field often consists precisely in having some well-trained black box neural network that performs fairly opaque background computations.


Bad things can happen when you demand explanations from black boxes

The second thesis is that it often makes sense to assume the mind runs distinct computational processes: one that actually makes decisions and reaches conclusions, and another that produces justifications and rationalizations.

In my experience, if you have good introspective access to your own reasoning, you may occasionally notice that a conclusion C depends mainly on some black box, but at the same time, you generated a plausible legible argument A for the same conclusion after you reached the conclusion C. 

If you try running, say, Double Crux over such situations, you'll notice that even if someone refutes the explicit reasoning A, you won't quite change the conclusion to ¬C. The legible argument A was not the real crux. It is quite often the case that (A) is essentially fake (or low-weight), whereas the black box is hiding a reality-tracking model.

Stretching the AlphaGo metaphor a bit: AlphaGo could be easily modified to find a few specific game "rollouts"  that turned out to "explain" the mysterious signal from the neural network. Using tree search, it would produce a few specific examples how such a position may evolve, which would be selected to agree with the neural net prediction. If AlphaGo showed them to you, it might convince you! But you would get a completely superficial understanding of why it evaluates the situation the way it does, or why it makes certain moves.


Risks from the legibility norm

When you make a strong norm pushing for too straightforward "epistemic legibility", you risk several bad things:

First, you increase the pressure on the "justification generator" to mask various black boxes by generating arguments supporting their conclusions.

Second, you make individual people dumber. Imagine asking a Go grandmaster to transparently justify his moves to you, and to play the moves that are best justified - if he tries to play that way, he will become a much weaker player. A similar thing applies to AlphaGo - if you allocate computational resources in such a way that a much larger fraction is consumed by tree search at each position, and less of the neural network is used overall, you will get worse outputs.

Third, there's a risk that people get convinced based on bad arguments - because their "justification generator" generated a weak legible explanation, you managed to refute it, and they updated. The problem comes if this involves discarding the output of the neural network, which was much smarter than the reasoning they accepted.

What we can do about it

My personal impression is that society as a whole would benefit from more transparent reasoning on the margin. 

What I'm not convinced of, at all, is that trying to reason much more transparently is a good goal for aspiring rationalists, or that some naive (but memetically fit) norms around epistemic legibility should spread.

To me, it makes sense for some people to specialize in very transparent reasoning. On the other hand, it also makes sense for some people to mostly "try to be better at Go", because legibility has various hidden costs.

A version of transparency that seems more robustly good to me is the one that takes legibility to a meta level. It's perfectly fine to refer to various non-interpretable processes and structures, but we should ideally add a description of what data they are trained on (e.g. “I played at the national level”). At the same time, if such black-box models outperform legible reasoning, it should be considered fine and virtuous to use models which work. You should play to win, if you can.


An example of a common non-legible communication:

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I don't know exactly, I feel it won't do him any good

An example of how to make the same conversation worse by naive optimization for legibility

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I read a thread on Twitter yesterday where someone explained that research on similar motivational techniques does not replicate, and also another thread where someone referenced research that people who over-organize their lives are less creative.

A: Those studies are pretty weak though.

B: Ah I guess you’re right. 

An example of how to actually improve the same conversation by striving for legibility:

A: Can you explain why you feel that getting this person to implement a "Getting Things Done" system is not a good idea?

B: I guess I can't explain it transparently to you. My model of this person just tells me that there is a fairly high risk that teaching them GTD won't have good results. I think it's based on experience with a hundred people I've met on various courses who are trying to have a positive impact on the world. Also, when I had similar feelings in the past, it turned out they were predictive in more than half of the cases.

If you've always understood the terms "reasoning transparency" or "epistemic legitimacy" in the spirit of the third conversation, and your epistemology routinely involves steps like "I'm going to trust this black-box trained on lots of data a lot more than this transparent analysis based on published research", then you're probably safe.  

How this looks in practice

In my view, it is pretty clear that some of the main cruxes of current disagreements about AI alignment are beyond the limits of legible reasoning. (The current limits, anyway.) 

In my view, some of these intuitions have roughly the "black-box" form explained above. If you try to understand the disagreements between e.g. Paul Christiano and Eliezer Yudkowsky, you often end up in a situation where the real difference is "taste", which influences how much weight they give to arguments, how good or bad various future "board positions" are evaluated to be, etc. Both Elizer and Paul are extremely smart, have spent more than a decade thinking about AI safety and even more time on relevant topics such as ML or decision theory or epistemics.

A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board, with the further unfortunate feature that you can't just make them play against each other, because in some sense they are both playing for the same side.

This isn't a great position to be in. But in my view it's better to understand where you are rather than, for example, naively updating on a few cherry-picked rollouts.

See also

Thanks to Gavin for help with writing  this post.

  1. ^

     We can go even further if we note that the later AlphaZero policy network doesn’t use tree search when playing.

New Comment
12 comments, sorted by Click to highlight new comments since:

I'm in broad overall agreement here- legibility is one of many virtues and the trade-offs are not always in its favor. I do think context matters a lot here. In one:many written broadcasts to strangers, I lean pretty heavily on legibility and track record (although still not exclusively). With my closest friends we spend a lot of time saying completely unjustifiable things that gesture at something important and work it out together, sometimes over months. 

I am afraid of the pattern where people claim intuition and track record when called on poor legibility, and "being wrong before doesn't make me wrong now, you have to engage with my argument" when called on a poor track record. I don't think this is necessarily more likely or a bigger harm than your fears about excessive pushes for legibility, but it does strike deeper for me personally. 

Great post! Strong upvoted.

I don't play go, but I do play some chess (I especially enjoy chess puzzles), and whenever I don't "get" the proposed solution to a puzzle I spin up the engine and poke around until the situation resolves into something that makes sense to me. Why didn't my line work? Oh, there's a bishop block, etc. etc.

Back to the example of AlphaGo justifying itself to me, there are ways other than tree search that AlphaGo can make a legible argument in favor of a particular line of play. For instance, it could ask "what would you play instead?" and then play against itself starting from my proposed alternative play. In some sense, this means that the answer to all questions of this form are "because when I play that way, I lose."

You correctly identify that this takes much more computation than simply playing go well, since it opens up the algorithm to explore as many positions as I want it to. Heck, even a single question from me can take a long time to resolve (it requires an entire game!). Even that game may not be sufficient justification for me, since it is likely to generate even more questions, "okay, sure, if you play like that my line is bad, but what about if you follow it up with this?" [several lines of play] "oh."

Nonetheless, with this way of communicating between engine and human it is the human who generalizes. The engine is simply giving its expert opinion in the form of "this play is better than that play", and it can demonstrate that it is probably true but it doesn't really ever say why. The engine being able to generate correct lines of play makes it a useful tool in teaching, but doesn't make it a good teacher. It's still up to me to figure out what concepts I'm missing that are leading to poor play.

Now, the Go example has unusually fast and clear feedback. Many arguments do not work this way since they make predictions about things that will not happen for years (or may not happen at all!) or are about hidden bits of information. The fact that the engine can demonstrate that its model of the game is better than mine is very unusual.

Go is different from Chess. The fact that Chess was solved so much earlier than Go is because the feedback isn't as fast. While you get fast and clear problems for Life and Death problems most moves played in a game don't have fast feedback. It frequently takes a hundred moves to see why a particular move is good or bad and a move that provides two points more than an alternative move can be a very good move, without looking like much over the space of 100 moves.

A couple of things: 

I disagree with your claim that Chess is a solved game. AIs play much better chess than humans, but AIs continue to improve. The AIs of today would trounce the AIs from a few years ago, and "solved" means to me that there is a known best strategy. I do agree that Go is a harder game, but I believe they are very much in kind with one another.

I'm suspicious that when you say that "game x is easier than game y because the feedback isn't as fast", you will end up needing to define "fast feedback" in a way that depends on the difficulty or complexity of the game. As a result, the statement "game x is easier than game y" would mean approximately the same thing as "game x has a faster feedback loop than game y" by definition. Here are some ways you might define the speed of feedback so that the statement could be made, along with some problems they run into:

  • Time in seconds between making a move and the end of the game. Well, chess and go both take about as long as each other, so this seems like the wrong axis already.
  • Number of moves between making a move and the end of the game. It seems easy to imagine a game that has many moves, and where individual moves can be very strong or very weak, but is nonetheless very easy. For instance, we could modify the game of 21 to instead count to ten trillion; the feedback from move one to your win or loss is long and yet the game is just as easy as 21.
  • The amount of compute required to determine how good a move is[1]. This would be my definition of the difficulty of a game.

Chess frequently has far-reaching consequences of relatively quiet moves. This video of AlphaZero versus Stockfish is my personal favorite example of this. Stockfish is up in material for the majority of the game (and thinks it is winning handily for most of this time), but loses convincingly by the end of the game. An AI that is far superior to me at chess failed to see the long-reaching consequences of its plays; it was not obvious to it that it had played a bad move for quite a while.

When I think of the speed of feedback, I think of the amount of time in seconds that elapses between making a decision and receiving new information from the outside world that would be meaningfully different if I had made a different decision. I expect that definition to have problems, since I don't know how to define "meaningfully different information", but it's the best I can do right now.

For example, after I post this I could re-read my own post, and evaluate it on its merits according to me, but I would not consider that feedback since that is information that was largely internal. In contrast, if/when I get a response from you, or upvotes/downvotes from the community, that is feedback I can use to update my models of the world. That time lag that's out of my control[2] is what I think of as the speed of feedback.

  1. ^

    The amount of compute will depend on the method being used to convert a board-state into a move, and so a game could be hard for some players but easy for others. As a fairly extreme example, we could compare an already-trained stockfish against the code that would allow you to train stockfish; the first will require much less compute to decide why (and whether) a given move is good or bad.

  2. ^

    I suppose I could DM random people and ask them to read the post, but that would be very weird.

A person new to AI safety evaluating their arguments is roughly at a similar position to a Go novice trying to make sense of two Go grandmasters disagreeing about a board

I don't think the analogy is great, because Go grandmasters have actually played, lost and (critically) won a great many games of Go. This has two implications: first, I can easily check their claims of expertise. Second, they have had many chances to improve their gut level understanding of how to play the game of Go well, and this kind of thing seems to be to necessary to develop expertise.

How does one go about checking gut level intuitions about AI safety? It seems to me that turning gut intuitions into legible arguments that you and others can (relatively) easily check is one of the few tools we have, with objectively assessable predictions being another. Sure, both are hard, and it would be nice if we had easier ways to do it, but it seems to me that that's just how it is.

In case of AI safety, the analogy maps through things like past research results, or general abilities to reason and make arguments.  You can check the claim that e.g. Eliezer historically made many good non-trivial arguments about AI, where he was the first person, or one of the first people to make them.  While the checking part is less easy than in chess, I would say it's roughly comparable to high level math, or good philosophy.

This is a short self-review, but with a bit of distance, I think understanding 'limits to legibility' is one of the maybe top 5 things an aspiring rationalist should deeply understand and lack of this leads to many bad outcomes in both rationalist and EA communities.

In a very brief form, maybe the most common cause of EA problem and stupidities are attempts to replace illegible S1 boxes able to represent human values such as 'caring' by legible, symbolically described, verbal moral reasoning subject to memetic pressure.

Maybe the most common cause of rationalist problems and difficulties with coordination are cases where people replace illegible smart S1 computations with legible S2 arguments.

Tree search reasoning is naturally legible: the "argument" is simply a sequence of board states. In contrast, the neural network is mostly illegible

You can express tree search in terms of a huge tree of board states. You can express neural nets as a huge list of arithmetic. Both are far too huge for a human to read all of. 

I don't think the intuition "both are huge" so "~ roughly equal" is correct.

Tree search is decomposable into specific sequence of a board states, which are easily readable; in practice trees are pruned, and can be pruned to human-readable sizes.

This isn't true for the neural net. If you decompose the information in AlphaGo net into a huge list of arithmetic, if the "arithmetic" is the whole training process, the list is much larger than in the first case. If it's just the trained net, it's less interpretable than the tree.

this is not how the third conversation should go, in my opinion. instead. you should say inquiry your Inner Simulator, and then say that you expect that learning GTD will make them more anxious or will work for two weeks and then stop, so the initial investment in time will not pay off, or that in the past you encountered people who tried and it make them to crash down parts of themselves, or you expect it will work to well and lead to burnout.

it is possible to compare illegible intuitions - by checking what different predictions they produce, by comparing possible differences in the sorting of the training data. 

in my experience, different illegible intuitions come from people see different parts of the whole picture, and it's valuable to try to understand better. also, making predictions, describe the differences between word when you right and world when you wrong, having at least two different hypotheses, is all way to make the illegible intuitions better.

To me, this post suffers from ignoring the present state of the art. Kata Go is designed to help people learn Go. Kata Go does so by providing the player with information about the expected board value, so that it not only tells you which move is best but provides you a lot more feedback.

While it's theoretically possible that this focus on maximizing points that's different than what AlphaGo did leads to worse play in some circumstances, there's no good way to tell apriori whether Kata Go plays worse because it cares about the expected value of different moves.

[+][comment deleted]20Review for 2022 Review