OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:
It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:
...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then wh
At the start you write
3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.
And later Claim 3 is:
Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field
It seems to me there might be two things being pointed to?
A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;
And arguably these two things are in tension, in the fol...
Hey Joseph, thanks for the substantial reply and the questions!
Why call this a theory of interpretability as opposed to a theory of neural networks?
Yeah this is something I am unsure about myself (I wrote: "something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI'"). But I think I was imagining that a 'theory of neural networks' would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are intere...
I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?
I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it ...
This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:
mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,
a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evalua...
I've always found it a bit odd that Alignment Forum submissions are automatically posted to LW.
If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.
I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself.
I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually ma...
I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level.
You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....bu...
Something ~ like 'make it legit' has been and possibly will continue to be a personal interest of mine.
I'm posting this after Rohin entered this discussion - so Rohin, I hope you don't mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck's old talk transcript 'My personal cruxes for working on AI safety'. (Rohin's comment repeated here in full and please bear in mind this is 3 years old; his views I'm sure have developed and potentially moved a lot since then:)
I enjoyed this post, it was good to see this all laid o
Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here. And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.
Ok it sounds to me like maybe there's at least two things being talked about here. One situation is
A) Where a community includes different groups working on the same topic, and where th...
Re: e.g. superposition/entanglement:
I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it. Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in f...
Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:
Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as oth
Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.
>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural
I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.
I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about. And this...
I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.
Interesting idea. I think it’s possible that a prize is the wrong thing for getting the best final result (but also possible that getting a half decent result is more important than a high variance attempt at optimising for the best result). My thinking is: To do what you’re suggesting to a high standard could take months of serious effort. The idea of someone really competent doing so just for the chance at some prize money doesn’t quite seem right to me… I think there could be people out there who in principle could do it excellently but who would want to know that they’d ‘got the job’ as it were before spending serious effort on it.
I think I would support Joe's view here that clarity and rigour are significantly different... but maybe - David - your comments are supposed to be specific to alignment work? e.g. I can think of plenty of times I have read books or articles in other areas and fields that contain zero formal definitions, proofs, or experiments but are obviously "clear", well-explained, well-argued etc. So by your definitions is that not a useful and widespread form of rigour-less clarity? (One that we would want to 'allow' in alignment work?) Or would you instead maintain ...
I agree that the space may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )
You're right about the loss thing; it isn't as important as I first thought it might be.
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified fu...
I'm not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.
I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over was included in the function I called "", but it doesn't affect my main point).
I think the most concrete thing is that the function - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like
Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.
Thanks for the nice reply.
I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they're sufficient to ~fully make sense of what's going on. So I don't feel confused about the situation anymore. By "shocking" I meant something more like "calls for an explanation", not "calls for an explanation, and I don't have an explanation that feels adequate". (With added overtones of "horrifying".)
Yeah, OK, I think that helps clarify things for me.
As someone who was working a
I'm a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I'll give it a go anyway. There's a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:
(I do think it's possible to create a much better intro resource than any that exist today, but 'we can do much better' is compatible with 'it's shocking that the existing material hasn't already finished the job'.)
I feel like I want to ask: Do you really...
No need to be sheepish, IMO. :) Welcome to the conversation!
Do you really find it "shocking"?
I think it's the largest mistake humanity has ever made, and I think it implies a lower level of seriousness than the seriousness humanity applied to nuclear weapons, asteroids, climate change, and a number of other risks in the 20th-century. So I think it calls for some special explanation beyond 'this is how humanity always handles everything'.
I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbri...
Thanks again for the reply.
In my notation, something like or are functions in and of themselves. The function evaluates to zero at local minima of .
In my notation, there isn't any such thing as .
But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathe...
Thanks for the substantive reply.
First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function we have:
i.e. it's that might be something ...
This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.
I think that your setup is essentially that there is an -dimensional parameter space, let's call it say, and then for each element of the training set, we can consider the function ...
Thanks for the comments and pointers!
Given that you've described various 'primarily conceptual' projects on the Alignment Team, and given the distinction between Scientists and Engineers, one aspect that I'm unsure about is roughly: Would you expect a Research Scientist on the Alignment Team to necessarily have a minimum level of practical ML knowledge? Are you able to say any more about that? e.g. Would they have to pass a general Deep Mind coding test or something like that?
Hi Ryan, do you still plan for results to come out by May 27? And for those who are successful for the next stage to start June 6th etc.? (That's what is says on the FAQ on the website still).
Yes I think you understood me correctly. In which case I think we more or less agree in the sense that I also think it may not be productive to use Richard's heuristic as a criterion for which research directions to actually pursue.
I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.
One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a brea...
I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling' as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments fro...
It could also work here. But I do feel like pointing out that the bounty format has other drawbacks. Maybe it works better when you want a variety of bitesize contributions, like various different proposals? I probably wouldn't do work like Abram proposes - quite a long and difficult project, I expect - for the chance of winning a prize, particularly if the winner(s) were decided by someone's subjective judgement.
This post caught my eye as my background is in mathematics and I was, in the not-too-distant past, excited about the idea of rigorous mathematical AI alignment work. My mind is still open to such work but I'll be honest, I've since become a bit less excited than I was. In particular, I definitely "bounced off" the existing write-ups on Infrabayesianism and now without already knowing what it's all about, it's not clear it's worth one's time. So, at the risk of making a basic or even cynical point: The remuneration of the proposed job could be important for getting attention/ incentivising people on-the-fence.
How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to 'mentor matching'?
(e.g. see Richard Ngo's first high-level point in this career advice post)