## LESSWRONGLW

Spencer Becker-Kahn

Independent AI Safety Researcher. Previously SERI MATS scholar and FHI Senior Research Scholar. Before that, pure math in academia at Cambridge, UW, MIT.  Twitter. LinkedIn

# Wiki Contributions

My rejoinder to this is that, analogously to how a causal model can be re-implemented as a more complex non-causal model[2], a learning algorithm that looks at data that in some ways is saying something about causality, be it because the data contains information-decision-action-outcome units generated by agents, because the learning thing can execute actions itself and reflectively process the information of having done such actions, or because the data contains an abstract description of causality, can surely learn causality.

Short comment/feedback just to say: This sentence is making one of your main points but is very tricky! - perhaps too long/too many subclauses?

Ah OK, I think I've worked out where some of my confusion is coming from:  I don't really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): "Current mathematical research could play a similar role in the coming years..." But why might it? Isn't that where you need to be arguing?

The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that's why I don't understand how these are examples.

Thanks for the comment Rohin, that's interesting (though I haven't looked at the paper you linked).

I'll just record some confusion I had after reading your comment that stopped me replying initially: I was confused by the distinction between modular and non-modular because I kept thinking: If I add a bunch of numbers  and  and don't do any modding, then it is equivalent to doing modular addition modulo some large number (i.e. at least as large as the largest sum you get). And otoh if I tell you I'm doing 'addition modulo 113', but I only ever use inputs that add up to 112 or less, then you never see the fact that I was secretly intending to do modular addition. And these thoughts sort of stopped me having anything more interesting to add.

I'm still not sure I buy the examples. In the early parts of the post you seem to contrast 'machine learning research agendas' with 'foundational and mathematical'/'agent foundations' type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing "mathematical and foundational" work.

I can't say much about the Goodhart's Law comment but it seems at best unclear that its link to goal misgeneralization is an example of the kind you are looking for (i.e. in the absence of much more concrete examples, I have approximately no reason to believe it has anything to do with what one would call mathematical work).

Strongly upvoted.

I roughly think that a few examples showing that this statement is true will 100% make OP's case. And that without such examples, it's very easy to remain skeptical.

Currently, it takes a very long time to get an understanding of who is doing what in the field of AI Alignment and how good each plan is, what the problems are, etc.

Is this not ~normal for a field that it maturing? And by normal I also mean approximately unavoidable or 'essential'. Like I could say 'it sure takes a long time to get an understanding of who is doing what in the field of... computer science', but I have no reason to believe that I can substantially 'fix' this situation in the space of a few months. It just really is because there is lots of complicated research going on by lots of different people, right? And 'understanding' what another researcher is doing is sometimes a really, really hard thing to do.

I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work).  But I decided maybe it's best to comment in a way that gives a better signal than silence.

I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.

Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.

I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.

I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like:

If successful, it may also help improve our ability to verify reasoning about complex questions, like those emerging in modern machine learning, for which we expect formal proof to be impossible.

In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like:  That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc.  Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?

To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:

Neither of these applications [to avoiding catastrophic failures or to ELK] is straightforward, and it should not be obvious that heuristic arguments would allow us to achieve either goal.

But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").

Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.

Hi Garrett,

OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:

• I did not use any words such as "trivial", "obvious" or "simple". Stories like the one you recount are obviously making fun of mathematicians, some of whom do think its cool to say things are trivial/simple/obvious after they understand them. I often strongly disagree and generally dislike this behaviour and think there are many normal mathematicians who don't engage in this sort of thing.  In particular sometimes the most succinct insights are the hardest ones to come by (this isn't a reference to my post; just a general point).  And just because such insights are easily expressible once you have the right framing and the right abstractions, they should by no means be trivialized.
• I deliberately emphasized the subjectivity of making the sorts of judgements that I am making. Again this kinda forms part of the joke of the story.
• I have indeed been aware of the work since when it was first posted 10 months ago or so and have given it some thought on and off for a while (in the first sentence of the post I was just saying that I didn't spend long writing the post, not that these thoughts were easily arrived-at).
• I do not claim to have explained the entire algorithm, only to shed some light on why it might actually be a more natural thing to do than some people seem to have appreciated.
• I think the original work is of a high quality and one might reasonably say 'groundbreaking'.

In another one of my posts I discuss at more length the kind of thing you bring up in the last sentence of your comment, e.g.

it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena.

....[but]... one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "What kind of thing seems most useful for understanding this object and making predictions?" but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste.

And e.g. I talk about how this sort of thing has been the case in areas like mathematical physics for a long time. Part of the point is that (in my opinion, at least) there isn't any neat shortcut to the kind of abstract thinking that lets you make the sort of predictions you are making reference to. It is very typical that you have to begin by reacting to existing empirical phenomena and using it as scaffolding. But I think, to me, it has come across as that you are being somewhat dismissive of this fact? As if, when B might well follow from A and someone actually starts to do A, you say "I would be far more impressed if B" instead of "maybe that's progress towards B"?

(Also FWIW,  Neel claims here that regarding the algorithm itself, another researcher he knows "roughly predicted this".)

Interesting thoughts!

It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:

...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then what "program" a particular training regimen will lead to in that language  [7]. It seems to me that by focusing on specific trained models, most interpretability research discussed here is of the second type. But by constructing an effective theory for an entire class of architecture that's agnostic to the choice of dataset, PDLT is a rare example of the first type.

I don't necessarily totally agree with her phrasing but it does feel a bit like we are all gesturing at something vaguely similar (and I do agree with her that PDLT-esque work may have more insights in this direction than some people on our side of the community have appreciated).

FWIW, in a recent comment reply to Joseph Bloom, I also ended up saying a bit more about why I don't actually see myself working much more in this direction, despite it seeming very interesting, but I'm still on the fence about that.  (And one last point that didn't make it into that comment is the difficulties posed by a world in which increasingly the plucky bands of interpretability researchers on the fringes literally don't even know what the cutting edge architectures and training processes in the biggest labs even are.