Spencer Becker-Kahn

Wiki Contributions


I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself. 

I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually made by the author. 

Broadly speaking, these "mistakes" seem to me like mostly normal ways of learning and doing a PhD that happen for mostly good reasons and my reaction to the fact that these "mistakes" were "figured out" towards the end of the PhD is that this is a predictable part of the transition from being primarily a student to primarily an independent researcher (the fast-tracking of which would be more difficult than a lot of rationalists would like to believe). 

I also worry that emphasizing these things as "mistakes" might actually lead people to infer that they should 'do the opposite' from the start, which to me would sound like weird/bad advice: e.g Don't try to catch up with people who are more knowledgeable than you; don't try to seem smart and defensible; don't defer, you can do just as good by thinking everything through for yourself. 

I broadly agree that

rationality is not about the bag of facts you know.

but AI alignment/safety/x-risk isn't synonymous with rationality (Or is it? I realise TurnTrout does not directly claim that it is, which is why I'm maybe more cautioning against a misreading than disagreeing with him head on, but maybe he or others think there is a much closer relationship between rationality and alignment work than I do?). 

Is there not, by this point, something at least a little bit like "a bag of facts" that one should know in AI Alignment? People have been thinking about AI alignment for at least a little while now.  And so like, what have they achieved? Do we or do we not actually have some knowledge about the alignment problem? It seems to me that it would be weird if we didn't have any knowledge - like if there was basically nothing that we should count as established and useful enough to be codified and recorded as part of the foundations of the subject. It's worth wondering whether this has perhaps changed significantly in the last 5-10 years though, i.e. during TurnTrout's PhD. That is, perhaps - during that time - the subject has grown a lot and at least some things have been sufficiently 'deconfused' to have become more established concepts etc.  But generally, if there are now indeed such things, then these are probably things that people entering the field should learn about.  And it would seem likely that a lot of the more established 'big names'/productive people actually know a lot of these things and that "catching up with them" is a pretty good instrumental/proxy way to get relevant knowledge that will help you do alignment work. (I almost want to say: I know it's not fashionable in rationality to think this, but wanting to impress the teacher really does work pretty well in practice when starting out!)

Focussing on seeming smart and defensible probably can ultimately lead to a bad mistake. But when framed more as "It's important to come across as credible" or "It's not enough to be smart or even right; you actually do need to think about how others view you and interact with you", it's not at all clear that it's a bad thing; and certainly it more clearly touches on a regular topic of discussion in EA/rationality about how much to focus on how one is seen or how 'we' are viewed by outsiders. Fwiw I don't see any real "mistake" being actually described in this part of the post. In my opinion, when starting out, probably it is kinda important to build up your credibility more carefully. Then when Quintin came to TurnTrout, he writes that it took "a few days" to realize that Quintin's ideas could be important and worth pursuing.  Maybe the expectation in hindsight would be that he should have had the 'few days' old reaction immediately?? But my gut reaction is that that would be way too critical of oneself and actually my thought is more like 'woah he realised that after thinking about it for only a few days; that's great'. Can the whole episode not be read as a straightforward win: "Early on, it is important to build your own credibility by being careful about your arguments and being able to back up claims that you make in formal, public ways. Then as you gain respect for the right reasons, you can choose when and where to 'spend' your credibility... here's a great example of that..."

And then re: deference, certainly it was true for me that when I was starting out in my PhD, if I got confused reading a paper or listening to talk, I was likely to be the one who was wrong.  Later on or after my PhD, then, yeah, when I got confused by someone else's presentation, I was less likely to be wrong and it was more likely I was spotting an error in someone else's thinking. To me this seems like a completely normal product of the education and sort of the correct thing to be happening. i.e. Maybe the correct thing to do is to defer more when you have less experience and to gradually defer less as you gain knowledge and experience? I'm thinking that under the simple model that when one is confused about something, either you're misunderstanding or the other person is wrong, one starts out in the regime where your confusion is much more often better explained by the fact you have misunderstood and you end up in the regime where you actually just have way more experience thinking about these things and so are now more reliably spotting other people's errors. The rational response to the feeling of confusion changes because once fully accounted for the fact you just know way more stuff and are a way more experienced thinker about alignment. (One also naturally gains a huge boost to confidence as it becomes clear you will get your PhD and have good postdoc prospects etc... so it becomes easier to question 'authority' for that reason too, but it's not a fake confidence boost; this is mostly a good/useful effect because you really do now have experience of doing research yourself, so you actually are more likely to be better at spotting these things).


I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level. 

You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....but in that case....what does it include? And how we will know if 'most' of them have some property? (At the moment, whenever I find evidence that two systems don't share an abstraction that they 'ought to' I can go "well the hypothesis is only most"...)

Something ~ like 'make it legit' has been and possibly will continue to be a personal interest of mine.

I'm posting this after Rohin entered this discussion - so Rohin, I hope you don't mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck's old talk transcript 'My personal cruxes for working on AI safety'. (Rohin's comment repeated here in full and please bear in mind this is 3 years old; his views I'm sure have developed and potentially moved a lot since then:)

I enjoyed this post, it was good to see this all laid out in a single essay, rather than floating around as a bunch of separate ideas.

That said, my personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well, including:

1. Field building: Research done now can help train people who will be able to analyze problems and find solutions in the future, when we have more evidence about what powerful AI systems will look like.

2. Credibility building: It does you no good to know how to align AI systems if the people who build AI systems don't use your solutions. Research done now helps establish the AI safety field as the people to talk to in order to keep advanced AI systems safe.

3. Influencing AI strategy: This is a catch all category meant to include the ways that technical research influences the probability that we deploy unsafe AI systems in the future. For example, if technical research provides more clarity on exactly which systems are risky and which ones are fine, it becomes less likely that people build the risky systems (nobody _wants_ an unsafe AI system), even though this research doesn't solve the alignment problem.

As a result, cruxes 3-5 in this post would not actually be cruxes for me (though 1 and 2 would be).

Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here.  And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.

Ok it sounds to me like maybe there's at least two things being talked about here. One situation is

 A) Where a community includes different groups working on the same topic, and where those groups might use different terminology and have different ways of thinking about the same phenomena etc. This seems completely normal to me. The other situation is 

B) Where a group is isolated from the community at large and is using different terminology/thinking about things differently just as a result of their isolation and lack of communication. And where that behaviour then causes confusion and/or wasting of resources.

The latter doesn't sound good, but I guess it looks like to me that some or many of your points are consistent with the former being the case. So when you write e.g. it's not "necessarily a good thing either" or asking for my steelmanned case, this doesn't seem to quite make sense to me. I feel like if something is not necessarily good or bad, and you want to raise it as a criticism, then the onus would be on you to bring the case against TASIC with arguments that are not general ones that could easily apply to both A) and B) above. e.g.  It'd be more of an emphatic case if you were able to go into the details and be like "X did this work here and claimed it was new but actually it exists in Y's paper here" or give a real example of needless confusion that was created and could have been avoided. Focussing just on what they did or didn't 'engage with' on the level of general concepts and citations/acknowledgements doesn't bring this case convincingly, in my opinion. Some more vague thoughts on why that is:

  • Bodies of literature like this are usually very complicated and messy and people genuinely can't be expected to engage with everything. 
  • It's often hard or impossible to tack dependencies of ideas because of all the communication you cannot see and not being able to see 'how' people are thinking of things, only what they wrote.
  • Someone publishing on the same idea or concept or topic as you is nowhere near the same as someone actually doing the exact same technical thing that you are doing.  ime the former is happening all the time; and the latter is much rarer than people often think. 
  • Reinvention, re-presentation and even outright renaming or 'starting from scratch' are all valuable elements of scholarship that help a field move along.

Idk maybe I'm just repeating myself at this point.

On the other point: It may turn out the MI's analogy with reverse software engineering does not produce methods and is just used a high-level analogy,, but it seems too early to say from my perspective - the two posts I linked are from last year. TASIC is still pretty small and experienced researchers in TASIC are fewer and this is potentially a large and difficult research agenda.

Re: e.g. superposition/entanglement: 

I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it.  Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in favour of different terminology, i.e. even if something is "the same" when boiled down to a formal level, the different names can actually help delineate different interpretations.

In fact it's almost like a running joke in academia that there's always someone grumbling that you didn't cite the right things (their favourite work on this topic, their fellow countryman, them etc.) and because of the way academic literature works, some of the things that you are doing here can be done with almost any piece of work in the literature, i.e. you can comb over it with the benefit of hindsight and say 'hang on this isn't as original as it looked; basically the same idea that was written about here X years before' etc.  Honestly, I don't usually think of this as a valuable exercise, but I may be missing something about your wider point or be more convinced once I've looked at more of your series.

Another point when it comes to 'originality' and 'progress' is that it's often unimportant if some idea was generally discussed, labelled, named, or thought about before if what matters is actual results and the lower-level content of these works. i.e. I may be wrong, but looking at what you are saying, I don't think you are literally pulling up an older paper on 'entanglement' that made the exact same points that the Anthropic papers were making and did very similar experiments (Or are you?) And even having said that, reproducing experiments exactly is of course very valuable.

Re: MI and program synthesis:

I understand that your take is that it is closer to program synthesis or program induction and that these aren't all the same thing but in the first subsection of the "TASIC has reinvented..." section, I'm a little confused why there's no mention of reverse engineering programs from compiled binary? The analogy with reverse engineering programs is one that MI people have been actively thinking about, writing about and trying to understand ( see e.g. Olah, and Nanda, in which he consults an expert). 

Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:

Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as other fields?). How do you lean in this regard? You mentioned that you're not sure when it comes to how amenable interpretability itself is, but would you guess that it's more or less amenable than deep learning as a whole?

I think I kind of share your general concern here and I’m uncertain about it. I kind of agree that it seems like people had been trying for a while to figure out the right way to think about deep learning mathematically and that for a while it seemed like there wasn’t much progress. But I mean it when I say these things can be slow. And I think that the situation is developing and has changed - perhaps significantly - in the last ~5 years or so, with things like the neural tangent kernel, the Principles of Deep Learning Theory results and increasingly high-quality work on toy models. (And even when work looks promising, it may still take a while longer for the cycle to complete and for us to get ‘real world’ results back out of these mathematical points of view, but I have more hope than I did a few years ago). My current opinion is that certain aspects of interpretability will be more amenable to mathematics than understanding DNN-based AI as a whole .



How would success of this relate to capabilities research? It's a general criticism of interpretability research that it also leads to heightened capabilities, would this fare better/worse in that regard? I would have assumed that a developed rigorous theory of interpretability would probably also entail significant development of a rigorous theory of deep learning.

I think basically your worries are sound. If what one is doing is something like ‘technical work aimed at understanding how NNs work’ then I don’t see there as being much distinction between capabilities and alignment ; you are really generating insights that can be applied in many ways, some good some bad (and part of my point is you have to be allowed to follow your instincts as a scientist/mathematician in order to find the right questions). But I do think that given how slow and academic the kind of work I’m talking about is, it’s neglected by both a) short timelines-focussed alignment people and b) capabilities people.



How likely is it that the direction one may proceed in would be correct? You mention an example in mathematical physics, but note that it's perhaps relatively unimportant that this work was done for 'pure' reasons. This is surprising to me, as I thought that a major motivation for pure math research, like other blue sky research, is that it's often not apparent whether something will be useful until it's well developed. I think this is the similar to you mentioning that the small scale problems will not like the larger problem. You mention that this involves following one's nose mathematically, do you think this is possible in general or only for this case? If it's the latter, why do you think interpretability is specifically amenable to it?

Hmm, that's interesting. I'm not sure I can say how likely it is one would go in the correct direction. But in my experience the idea that 'possible future applications' is one of the motivations for mathematicians to do 'blue sky' research is basically not quite right. I think the key point is that the things mathematicians end up chasing for 'pure' math/aesthetic reasons seem to be oddly and remarkably relevant when we try to describe natural phenomena (iirc this is basically a key point in Wigner's famous 'Unreasonable Effectiveness' essay.)  So I think my answer to your question is that this seems to be something that happens "in general" or at least does happen in various different places across science/applied math

Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.

>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural

I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.

I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about.  And this applies to the computations in NNs too. 

I don't want to study AI as mathematics or believe that AI is mathematics. I write that the practice of doing mathematics will only seek out the parts of the problem that are actually amenable to it; and my focus is on interpretability and not other places in AI that one might use mathematics (like, say, decision theory). 

You write "As an example, take "A mathematical framework for transformer circuits": it doesn't develop new mathematics. It just uses existing mathematics: tensor algebra.:" I don't think we are using 'new mathematics' in the same way and I don't think the way you are using it commonplace. Yes I am discussing the prospect of developing new mathematics, but this doesn't only mean something like 'making new definitions' or 'coming up with new objects that haven't been studied before'.  If I write a proof of a theorem that "just" uses "existing" mathematical objects, say like...matrices, or finite sets, then that seems to have little bearing on how 'new' the mathematics is. It may well be a new proof, of a new theorem, containing new ideas etc. etc. And it may well need to have been developed carefully over a long period of time.

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

Interesting idea. I think it’s possible that a prize is the wrong thing for getting the best final result (but also possible that getting a half decent result is more important than a high variance attempt at optimising for the best result). My thinking is: To do what you’re suggesting to a high standard could take months of serious effort. The idea of someone really competent doing so just for the chance at some prize money doesn’t quite seem right to me… I think there could be people out there who in principle could do it excellently but who would want to know that they’d ‘got the job’ as it were before spending serious effort on it.

Load More