How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to 'mentor matching'?
(e.g. see Richard Ngo's first high-level point in this career advice post)
Hi Garrett,
OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:
In another one of my posts I discuss at more length the kind of thing you bring up in the last sentence of your comment, e.g.
it can feel like the role that serious mathematics has to play in interpretability is primarily reactive, i.e. consists mostly of activities like 'adding' rigour after the fact or building narrow models to explain specific already-observed phenomena.
....[but]... one of the most lauded aspects of mathematics is a certain inevitability with which our abstractions take on a life of their own and reward us later with insight, generalization, and the provision of predictions. Moreover - remarkably - often those abstractions are found in relatively mysterious, intuitive ways: i.e. not as the result of us just directly asking "What kind of thing seems most useful for understanding this object and making predictions?" but, at least in part, as a result of aesthetic judgement and a sense of mathematical taste.
And e.g. I talk about how this sort of thing has been the case in areas like mathematical physics for a long time. Part of the point is that (in my opinion, at least) there isn't any neat shortcut to the kind of abstract thinking that lets you make the sort of predictions you are making reference to. It is very typical that you have to begin by reacting to existing empirical phenomena and using it as scaffolding. But I think, to me, it has come across as that you are being somewhat dismissive of this fact? As if, when B might well follow from A and someone actually starts to do A, you say "I would be far more impressed if B" instead of "maybe that's progress towards B"?
(Also FWIW, Neel claims here that regarding the algorithm itself, another researcher he knows "roughly predicted this".)
Interesting thoughts!
It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:
...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then what "program" a particular training regimen will lead to in that language [7]. It seems to me that by focusing on specific trained models, most interpretability research discussed here is of the second type. But by constructing an effective theory for an entire class of architecture that's agnostic to the choice of dataset, PDLT is a rare example of the first type.
I don't necessarily totally agree with her phrasing but it does feel a bit like we are all gesturing at something vaguely similar (and I do agree with her that PDLT-esque work may have more insights in this direction than some people on our side of the community have appreciated).
FWIW, in a recent comment reply to Joseph Bloom, I also ended up saying a bit more about why I don't actually see myself working much more in this direction, despite it seeming very interesting, but I'm still on the fence about that. (And one last point that didn't make it into that comment is the difficulties posed by a world in which increasingly the plucky bands of interpretability researchers on the fringes literally don't even know what the cutting edge architectures and training processes in the biggest labs even are.
At the start you write
3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.
And later Claim 3 is:
Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field
It seems to me there might be two things being pointed to?
A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;
And arguably these two things are in tension, in the following sense: I think that to a significant extent, one of the sources of unnecessary dilution is the issue of less experienced people not learning directly from more experienced people and instead relying too heavily on other inexperienced peers to develop their research skills and tastes. i.e. you might say that A) is partly caused by insufficient deference.
I roughly think that that the downsides of de-emphasizing deference and the accumulation of factual knowledge from more experienced people are worse than keeping it as sort of the zeroth order/default thing to aim for. It seems to me that to the extent that one believes that the field is making any progress at all, one should think that increasingly there will be experienced people from whom less experienced people should expect - at least initially - to learn from/defer to.
Looking at it from the flipside, one of my feelings right now is that we need mentors who don't buy too heavily into this idea that deference is somehow bad; I would love to see more mentors who can and want to actually teach people. (cf. The first main point - one that I agree with - that Richard Ngo made in his recent piece on advice: The area is mentorship constrained. )
Hey Joseph, thanks for the substantial reply and the questions!
Why call this a theory of interpretability as opposed to a theory of neural networks?
Yeah this is something I am unsure about myself (I wrote: "something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI'"). But I think I was imagining that a 'theory of neural networks' would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are interesting about NNs mathematically or scientifically but which aren't really contributing to our ability to understand and manage the intelligences that NNs give rise to. So I wanted to try to shift the emphasis away from 'understanding NNs' and towards 'interpreting AI'.
But maybe the distinction is more minor than I was originally worried about; I'm not sure.
have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
No, I haven't really. It was - and maybe still is - a sort of plan B of mine. I don't know anyone who I would say has this as their research agenda. I think the closest/next best thing people are well known, e.g. the more theoretical parts of Anthropic/Neel's work and more recently the interest in singular learning theory from Jesse Hoogland, Alexander GO, Lucius Bushnaq and maybe others. (afaict there is a belief that it's more than just 'theory of NNs' but can actually tell us something about safety of the AIs)
One thing I struggle to understand, and might bet against is that this won't involve studying toy models. To my mind, Neel's grokking work, Toy Models of Superposition, Bhillal's "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations" all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore....
I think maybe I didn't express myself clearly or the analogy I tried to make didn't work as intended, because I think maybe we actually agree here(!). I think one reason I made it confusing is because my default position is more skeptical about MI than a lot of readers probably....so, with regards to the part where I said: "it is reasonable that the early stages of rigorous development don't naively 'look like' the kinds of things we ultimately want to be talking about. This is very relevant to bear in mind when considering things like the mechanistic interpretability of toy models." What I was trying to get at is that to me proving e.g. some mathematical fact about superposition in a toy model doesn't look like the kind of 'intepretability of AI' that you really ultimately want, it looks too low-level. It's a 'toy model' in the NN sense, but its not a toy version of the hard part of the problem. But I was trying to say that you would indeed have to let people like mathematicians actually ask these questions - i.e ask the questions about e.g. superposition that they would most want to know the answers to, rather than forcing them to only do work that obviously showed some connection to the bigger theme of the actual cognition of intelligent agents or whatever.
Thanks for the suggestions about next steps and for writing about what you're most interested in seeing. I think your second suggestion in particular is close to the sort of thing I'd be most interested in doing. But I think in practice, a number of factors have held me back from going down this route myself:
I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?
I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it was actually trying to do something that caused happiness). If not, then the pseudo-alignment might be more like (what RFLO calls) suboptimality in some sense i.e. it just looks aligned because it's not capable enough to imprison the humans in cells etc.
The type of pseudo-alignment in (2) otoh seems more clearly like "side-effect" alignment since you've been explicit that secretly it was pursuing other things that just happened to cash out into happiness in training.
AFAIK Richard Ngo was explicit about wanting to clean up the zoo of inner alignment failure types in RFLO and maybe there just was some cost in doing this - some distinctions had to be lost perhaps?
This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:
mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,
where
a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evaluates inputs) by internally searching and choosing an output based on some objective function.
Apparently a "direct optimizer" is something that "perform[s] inference by directly choosing actions[1] to optimise some objective function". This sounds almost exactly like a mesa-optimizer?
I've always found it a bit odd that Alignment Forum submissions are automatically posted to LW.
If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.
I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself.
I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually made by the author.
Broadly speaking, these "mistakes" seem to me like mostly normal ways of learning and doing a PhD that happen for mostly good reasons and my reaction to the fact that these "mistakes" were "figured out" towards the end of the PhD is that this is a predictable part of the transition from being primarily a student to primarily an independent researcher (the fast-tracking of which would be more difficult than a lot of rationalists would like to believe).
I also worry that emphasizing these things as "mistakes" might actually lead people to infer that they should 'do the opposite' from the start, which to me would sound like weird/bad advice: e.g Don't try to catch up with people who are more knowledgeable than you; don't try to seem smart and defensible; don't defer, you can do just as good by thinking everything through for yourself.
I broadly agree that
rationality is not about the bag of facts you know.
but AI alignment/safety/x-risk isn't synonymous with rationality (Or is it? I realise TurnTrout does not directly claim that it is, which is why I'm maybe more cautioning against a misreading than disagreeing with him head on, but maybe he or others think there is a much closer relationship between rationality and alignment work than I do?).
Is there not, by this point, something at least a little bit like "a bag of facts" that one should know in AI Alignment? People have been thinking about AI alignment for at least a little while now. And so like, what have they achieved? Do we or do we not actually have some knowledge about the alignment problem? It seems to me that it would be weird if we didn't have any knowledge - like if there was basically nothing that we should count as established and useful enough to be codified and recorded as part of the foundations of the subject. It's worth wondering whether this has perhaps changed significantly in the last 5-10 years though, i.e. during TurnTrout's PhD. That is, perhaps - during that time - the subject has grown a lot and at least some things have been sufficiently 'deconfused' to have become more established concepts etc. But generally, if there are now indeed such things, then these are probably things that people entering the field should learn about. And it would seem likely that a lot of the more established 'big names'/productive people actually know a lot of these things and that "catching up with them" is a pretty good instrumental/proxy way to get relevant knowledge that will help you do alignment work. (I almost want to say: I know it's not fashionable in rationality to think this, but wanting to impress the teacher really does work pretty well in practice when starting out!)
Focussing on seeming smart and defensible probably can ultimately lead to a bad mistake. But when framed more as "It's important to come across as credible" or "It's not enough to be smart or even right; you actually do need to think about how others view you and interact with you", it's not at all clear that it's a bad thing; and certainly it more clearly touches on a regular topic of discussion in EA/rationality about how much to focus on how one is seen or how 'we' are viewed by outsiders. Fwiw I don't see any real "mistake" being actually described in this part of the post. In my opinion, when starting out, probably it is kinda important to build up your credibility more carefully. Then when Quintin came to TurnTrout, he writes that it took "a few days" to realize that Quintin's ideas could be important and worth pursuing. Maybe the expectation in hindsight would be that he should have had the 'few days' old reaction immediately?? But my gut reaction is that that would be way too critical of oneself and actually my thought is more like 'woah he realised that after thinking about it for only a few days; that's great'. Can the whole episode not be read as a straightforward win: "Early on, it is important to build your own credibility by being careful about your arguments and being able to back up claims that you make in formal, public ways. Then as you gain respect for the right reasons, you can choose when and where to 'spend' your credibility... here's a great example of that..."
And then re: deference, certainly it was true for me that when I was starting out in my PhD, if I got confused reading a paper or listening to talk, I was likely to be the one who was wrong. Later on or after my PhD, then, yeah, when I got confused by someone else's presentation, I was less likely to be wrong and it was more likely I was spotting an error in someone else's thinking. To me this seems like a completely normal product of the education and sort of the correct thing to be happening. i.e. Maybe the correct thing to do is to defer more when you have less experience and to gradually defer less as you gain knowledge and experience? I'm thinking that under the simple model that when one is confused about something, either you're misunderstanding or the other person is wrong, one starts out in the regime where your confusion is much more often better explained by the fact you have misunderstood and you end up in the regime where you actually just have way more experience thinking about these things and so are now more reliably spotting other people's errors. The rational response to the feeling of confusion changes because once fully accounted for the fact you just know way more stuff and are a way more experienced thinker about alignment. (One also naturally gains a huge boost to confidence as it becomes clear you will get your PhD and have good postdoc prospects etc... so it becomes easier to question 'authority' for that reason too, but it's not a fake confidence boost; this is mostly a good/useful effect because you really do now have experience of doing research yourself, so you actually are more likely to be better at spotting these things).
I think that perhaps as a result of a balance of pros and cons, I initially was not very motivated to comment (and haven't been very motivated to engage much with ARC's recent work). But I decided maybe it's best to comment in a way that gives a better signal than silence.
I've generally been pretty confused about Formalizing the presumption of Independence and, as the post sort of implies, this is sort of the main advert that ARC have at the moment for the type of conceptual work that they are doing, so most of what I have to say is meta stuff about that.
Disclaimer a) I have not spent a lot of time trying to understand everything in the paper. and b) As is often the case, this comment may come across as overly critical, but it seems highest leverage to discuss my biggest criticisms, i.e. the things that if they were addressed may cause me to update to the point I would more strongly recommend people applying etc.
I suppose the tldr is that the main contribution of the paper claims to be the framing of a set of open problems, but I did not find the paper able to convince me that the problems are useful ones or that they would be interesting to answer.
I can try to explain a little more: It seemed odd that the "potential" applications to ML were mentioned very briefly in the final appendix of the paper, when arguably the potential impact or usefulness of the paper really hinges on this. As a reader, it might seem natural to me that the authors would have already asked and answered - before writing the paper - questions like "OK so what if I had this formal heuristic estimator? What exactly can I use it for? What can I actually (or even practically) do with it?" Some of what was said in the paper was fairly vague stuff like:
In my opinion, it's also important to bear in mind that the criteria of a problem being 'open' is a poor proxy for things like usefulness/interestingness. (obviously those famous number theory problems are open, but so are loads of random mathematical statements). The usefulness/interestingness of course comes because people recognize various other valuable things too like: That the solution would seem to require new insights into X and therefore a proof would 'have to be' deeply interesting in its own right; or that the truth of the statement implies all sorts of other interesting things; or that the articulation of the problem itself has captured and made rigorous some hitherto messy confusion, or etc. etc. Perhaps more of these things need to be made explicit in order to argue more effectively that ARC's stating of these open problems about heuristic estimators is an interesting contribution in itself?
To be fair, in the final paragraph of the paper there are some remarks that sort of admit some of what I'm saying:
But practically it means that when I ask myself something like: 'Why would I drop whatever else I'm working on and work on this stuff?' I find it quite hard to answer in a way that's not basically just all deference to some 'vision' that is currently undeclared (or as the paper says "mostly defer[red]" to "future articles").
Having said all this I'll reiterate again that there are lots of clear pros to a job like this and I do think that there is important work to be done that is probably not too dissimilar from the kind being talked about in Formalizing the presumption of Independence and in this post.