All of Spencer Becker-Kahn's Comments + Replies

How exactly can an org like this help solve (what many people see as one of the main bottlenecks:) the issue of mentorship? How would Catalyze actually tip the scales when it comes to 'mentor matching'?

(e.g. see Richard Ngo's first high-level point in this career advice post)

Hi Garrett, 

OK so just being completely honest, I don't know if it's just me but I'm getting a slightly weird or snarky vibe from this comment? I guess I will assume there is a good faith underlying point being made to which I can reply. So just to be clear:

  • I did not use any words such as "trivial", "obvious" or "simple". Stories like the one you recount are obviously making fun of mathematicians, some of whom do think its cool to say things are trivial/simple/obvious after they understand them. I often strongly disagree and generally dislike this beh
... (read more)
3Garrett Baker5d
Sorry about that. On a re-read, I can see how the comment could be seen as snarky, but I was going more for critical via illustrative analogy. Oh the perils of the lack of inflection and facial expressions. I think your criticisms of my thought in the above comment are right-on, and you've changed my mind on how useful your post was. I do think that lots of progress can be made in understanding stuff by just finding the right frame by which the result seems natural, and your post is doing this. Thanks!

Interesting thoughts!

It reminds me (not only of my own writing on a similar theme) but of another one of these viewpoints/axes along which to carve interpretability work that is mentioned in this post by jylin04:

...a dream for interpretability research would be if we could reverse-engineer our future AI systems into human-understandable code. If we take this dream seriously, it may be helpful to split it into two parts: first understanding what "programming language" an architecture + learning algorithm will end up using at the end of training, and then wh

... (read more)

At the start you write

3. Unnecessarily diluting the field’s epistemics by introducing too many naive or overly deferent viewpoints.

And later Claim 3 is:

Scholars might defer to their mentors and fail to critically analyze important assumptions, decreasing the average epistemic integrity of the field

It seems to me there might be two things being pointed to?

A) Unnecessary dilution: Via too many naive viewpoints;
B) Excessive deference: Perhaps resulting in too few viewpoints or at least no new ones;

And arguably these two things are in tension, in the fol... (read more)

1Ryan Kidd1mo
Mentorship is critical to MATS. We generally haven't accepted mentorless scholars because we believe that mentors' accumulated knowledge is extremely useful for bootstrapping strong, original researchers. Let me explain my chain of thought better: 1. A first-order failure mode would be "no one downloads experts' models, and we grow a field of naive, overconfident takes." In this scenario, we have maximized exploration at the cost of accumulated knowledge transmission (and probably useful originality, as novices might make the same basic mistakes). We patch this by creating a mechanism by which scholars are selected for their ability to download mentors' models (and encouraged to do so). 2. A second-order failure mode would be "everyone downloads and defers to mentors' models, and we grow a field of paradigm-locked, non-critical takes." In this scenario, we have maximized the exploitation of existing paradigms at the cost of epistemic diversity or critical analysis. We patch this by creating mechanisms for scholars to critically examine their assumptions and debate with peers.

Hey Joseph, thanks for the substantial reply and the questions!



Why call this a theory of interpretability as opposed to a theory of neural networks? 

Yeah this is something I am unsure about myself (I wrote: "something that I'm clumsily thinking of as 'the mathematics of (the interpretability of) deep learning-based AI'"). But I think I was imagining that a 'theory of neural networks' would be definitely broader than what I have in mind as being useful for not-kill-everyoneism. I suppose I imagine it including lots of things that are intere... (read more)

1Joseph Bloom1mo
Thanks Spencer! I'd love to respond in detail but alas, I lack the time at the moment.  Some quick points: 1. I'm also really excited about SLT work.  I'm curious to what degree there's value in looking at toy models (such as Neel's grokking work) and exploring them via SLT or to what extent reasoning in SLT might be reinvigorated by integrating experimental ideas/methodology from MI (such as progress measures). It feels plausible to me that there just haven't been enough people in any of a number of intersections look at stuff and this is a good example. Not sure if you're planning on going to this: [] but it's probably not in the cards for me. I'm wondering if promoting it to people with MI experience could be good. 2. I totally get what you're saying about toy model in sense A or B doesn't necessarily equate to a toy model  being a version of the hard part of the problem. This explanation helped a lot, thank you!  3. I hear what you are saying about next steps being challenging for logistical and coordination issues and because the problem is just really hard! I guess the recourse we have is something like: Look for opportunities/chances that might justify giving something like this more attention or coordination. I'm also wondering if there might be ways of dramatically lowering the bar for doing work in related areas (eg: the same way Neel writing TransformerLens got a lot more people into MI).  Looking forward to more discussions on this in the future, all the best!

I spent some time trying to formulate a good response to this that analyzed the distinction between (1) and (2) (in particular how it may map onto types of pseudo alignment described in RFLO here) but (and hopefully this doesn't sound too glib) it started to seem like it genuinely mattered whether humans in separate individual heavily-defended cells being pumped full of opiates have in fact been made to be 'happy' or not?

I think because if so, it is at least some evidence that the pseudo-alignment during training is for instrumental reasons (i.e. maybe it ... (read more)

This is a very strong endorsement but I'm finding it hard to separate the general picture from RFLO:

mesa-optimization occurs when a base optimizer...finds a model that is itself an optimizer,


a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

i.e. a mesa-optimizer is a learned model that 'performs inference' (i.e. evalua... (read more)

I've always found it a bit odd that Alignment Forum submissions are automatically posted to LW. 

If you apply some of these norms, then imo there are questionable implications, i.e. it seems weird to say that one should have read the sequences in order to post about mechanistic interpretability on the Alignment Forum.

The AI Alignment Forum was never intended as the central place for all AI Alignment discussion. It was founded at a time when basically everyone involved in AI Alignment had read the sequences, and the goal was to just have any public place for any alignment discussion.  Now that the field is much bigger, I actually kind of wish there was another forum where AI Alignment people could go to, so we would have more freedom in shaping a culture and a set of background assumptions that allow people to make further strides and create a stronger environment of trust.  I personally am much more interested in reading about mechanistic interpretability from people who have read the sequences. That one in-particular is actually one of the ones where a good understanding of probability theory, causality and philosophy of science seems particularly important (again, it's not that important that someone has acquired that understanding via the sequences instead of some other means, but it does actually really benefit from a bunch of skills that are not standard in the ML or general scientific community).  I expect we will make some changes here in the coming months, maybe by renaming the forum or starting off a broader forum that can stand more on its own, or maybe just shutting down the AI Alignment Forum completely and letting other people fill that niche. 
4the gears to ascension2mo
similarly, I've been frustrated that medium quality posts on lesswrong about ai often get missed in the noise. I want alignmentforum longform scratchpad, not either lesswrong or alignmentforum. I'm not even allowed to post on alignmentforum! some recent posts I've been frustrated to see get few votes and generally less discussion: * [] - this one deserves at least 35 imo * [] * [] * [] * ... many more open in tabs I'm unsure about.

I really like this post and found it very interesting, particularly because I'm generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn't come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself. 

I think the tl;dr for my comment is sort of that to me the social dynamics "mistakes" don't really seem like mistakes - or at least not ones that were actually ma... (read more)

I've only skimmed this, but my main confusions with the whole thing are still on a fairly fundamental level. 

You spend some time saying what abstractions are, but when I see the hypothesis written down, most of my confusion is on what "cognitive systems" are and what one means by "most". Afaict it really is a kind of empirical question to do with "most cognitive systems". Do we have in mind something like 'animal brains and artificial neural networks'? If so then surely let's just say that and make the whole thing more concrete; so I suspect not....bu... (read more)

1Jonas Hallgren3mo
(My attempt at an explanation:) In short, we care about the class of observers/agents that get redundant information in a similar way. I think we can look at the specific dynamics of the systems described here to actually get a better perspective on whether the NAH should hold or not: * * I think you can think of the redundant information between you and the thing you care about as a function of all the steps in between for that information to reach you. * If we look at the question, we have a certain amount of necessary things for the (current implementation of) NAH to hold: * 1. Redundant information is rare * To see if this is the case you will want to look at each of the individual interactions and analyse to what degree redundant information is passed on. * I guess the question of "how brutal is the local optimisation environment" might be good to estimate each information redundancy (A,B,C,D in the picture). Another question is, "what level of noise do I expect to be formed at each transition?" as that would tell you to what degree the redundant information is lost in noise. (they pointed this out as the current hypothesis for usefulness in the post in section 2d.) * 2. The way we access said information is similar * If you can determine to what extent the information flow between two agents is similar, you can estimate a probability of natural abstractions occurring in the same way. * For example, if we use vision versus hearing, we get two different information channels & so the abstractions will most likely change. (Causal proximity of the individual functions is changed with regards to the flow of redundant information) * Based on this I would say that the question isn't really if it is true for NNs & brains in general but that it's rather more helpful to ask what information is abstracted with specific capabilities such as

Something ~ like 'make it legit' has been and possibly will continue to be a personal interest of mine.

I'm posting this after Rohin entered this discussion - so Rohin, I hope you don't mind me quoting you like this, but fwiw I was significantly influenced by this comment on Buck's old talk transcript 'My personal cruxes for working on AI safety'. (Rohin's comment repeated here in full and please bear in mind this is 3 years old; his views I'm sure have developed and potentially moved a lot since then:)

I enjoyed this post, it was good to see this all laid o

... (read more)
3Rohin Shah3mo
I still endorse that comment, though I'll note that it argues for the much weaker claims of * I would not stop working on alignment research if it turned out I wasn't solving the technical alignment problem * There are useful impacts of alignment research other than solving the technical alignment problem (As opposed to something more like "the main thing you should work on is 'make alignment legit'".) (Also I'm glad to hear my comments are useful (or at least influential), thanks for letting me know!)

Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here.  And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.

Ok it sounds to me like maybe there's at least two things being talked about here. One situation is

 A) Where a community includes different groups working on the same topic, and where th... (read more)

Re: e.g. superposition/entanglement: 

I think people should try to understand the wider context into which they are writing, but I don't see it as necessarily a bad thing if two groups of researchers are working on the same idea under different names. In fact I'd say this happens all the time and generally people can just hold in their minds that another group has another name for it.  Naturally, the two groups will have slightly different perspectives and this a) Is often good, i.e. the interference can be constructive and b) Can be a reason in f... (read more)

Thanks for the comment and pointing these things out.  --- Certainly it's not a necessarily good thing either. I would posit isolation is usually not good. I can personally attest to being confused and limited by the difference in terminology here.  And I think that when it comes to intrinsic interpretability work in particular, the disentanglement literature has produced a number of methods of value while TAISIC has not.  I don't know what we benefit from in this particular case with polysemanticity, superposition, and entanglement. Do you have a steelman for this more specific to these literatures?  --- Good point. I would not say that the issue with the feature visualization and zoom in papers were merely failing to cite related work. I would say that the issue is how they started a line of research that is causing confusion and redundant work. My stance here is based on how I see the isolation between the two types of work as needless. --- Thanks for pointing out these posts. They are examples of discussing a similar idea to MI's dependency on programmatic hypothesis generation, but they don't act on it. But they both serve to draw analogies instead of providing methods.  The thing in the front of my mind when I talk about how TAISIC has not sufficiently engaged with neurosymbolic work is the kind of thing I mentioned in the paragraph about existing work outside of TAISIC. I pasted it below for convenience :)
The main problem on this site is that despite people have large vary levels of understanding of different subject, nobody wants to look like an idiot on here. A lot of the comments and articles are basically nothing burgers. People often focus on insignificant points to argue about and waste their time in the social aspect of learning than to actually learn about a subject themselves. This made me wonder do actual researchers who have values and substance to offer and question, do they not participate in online discussions? The closest I've found is wordpress blogs by various people and people have huge comment chains. The only other form of communication seems to be through formal papers, which is pretty much as organized as it gets in terms of format. I've learned that people who do actually have deeper understanding and knowledge of value to offer, they don't waste their time on here. But I can't find any other platform that these people participate in. My guess is that they don't participate in any public discourse, only private conversations with other people who have things of value to offer and discuss.

Thanks very much for the comments I think you've asked a bunch of very good questions. I'll try to give some thoughts:

Deep learning as a field isn't exactly known for its rigor. I don't know of any rigorous theory that isn't as you say purely 'reactive', with none of it leading to any significant 'real world' results. As far as I can tell this isn't for a lack of trying either. This has made me doubt its mathematical tractability, whether it's because our current mathematical understanding is lacking or something else (DL not being as 'reductionist' as oth

... (read more)

Ah thanks very much Daniel. Yes now that you mention it I remember being worried about this a few days ago but then either forgot or (perhaps mistakenly) decided it wasn't worth expanding on. But yeah I guess you don't get a well-defined map until you actually fix how the tokenization happens with another separate algorithm. I will add to list of things to fix/expand on in an edit.

>There is no difference between natural phenomena and DNNs (LLMs, whatever). DNNs are 100% natural

I mean "natural" as opposed to "man made". i.e. something like "occurs in nature without being built by something or someone else". So in that sense, DNNs are obviously not natural in the way that the laws of physics are.

I don't see information and computation as only mathematical; in fact in my analogies I write that the mathematical abstractions we build as being separate from the things that one wants to describe or make predictions about.  And this... (read more)

-1Roman Leventov4mo
I feel that you are redefining terms. Writing down mathematical equations (or defining other mathematical structures that are not equations, e.g., automata), describing natural phenomena, and proving some properties of these, i.e., deriving some mathematical conjectures/theorems, -- that's exactly what physicists do, and they call it "doing physics" or "doing science" rather than "doing mathematics".  I wonder how would you draw the boundary between "man-made" and "non-man-made", the boundary that would have a bearing on such a fundamental qualitative distinction of phenomena as the amenability to mathematical description. According to Fields et al. []'s theory of semantics and observation ("quantum theory […] is increasingly viewed as a theory of the process of observation itself"), which is also consistent with predictive processing and Seth's controlled hallucination [] theory which is a descendant of predictive processing, any observer's phenomenology is what makes mathematical sense by construction. Also, here [] Wolfram calls approximately the same thing "coherence". Of course, there are infinite phenomena both in "nature" and "among man-made things" the mathematical description of which would not fit our brains yet, but this also means that we cannot spot these phenomena. We can extend the capacity of our brains (e.g., through cyborgisation, or mind upload), as well as equip ourselves with more powerful theories that allow us to compress reality more efficiently and thus spot patterns that were not spottable before, but this automatically means that these patterns become mathematically describable. This, of course, implies that we ought to make our minds stronger (through technical means or developing science) precisely to timely spot the phenomena that are ab

I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the 'unrelated bonus reason' at the end is potentially important and probably deserves more thought.

1Arthur Conmy7mo
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here [,Your%20first%20problem%20is%20that%20the%20recent%20capabilities%20gains%20made%20by%20the%20AGI%20might%20not%20have%20come%20from%20gradient%20descent%20(much%20like%20how%20humans%E2%80%99%20sudden%20explosion%20of%20technological%20knowledge%20accumulated%20in%20our%20culture%20rather%20than%20our%20genes%2C%20once%20we%20turned%20the%20corner).,-You%20might%20not], another example of "recruiting resources outside of the model alone". (however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

Interesting idea. I think it’s possible that a prize is the wrong thing for getting the best final result (but also possible that getting a half decent result is more important than a high variance attempt at optimising for the best result). My thinking is: To do what you’re suggesting to a high standard could take months of serious effort. The idea of someone really competent doing so just for the chance at some prize money doesn’t quite seem right to me… I think there could be people out there who in principle could do it excellently but who would want to know that they’d ‘got the job’ as it were before spending serious effort on it.

I think I would support Joe's view here that clarity and rigour are significantly different... but maybe - David - your comments are supposed to be specific to alignment work? e.g. I can think of plenty of times I have read books or articles in other areas and fields that contain zero formal definitions, proofs, or experiments but are obviously "clear", well-explained, well-argued etc. So by your definitions is that not a useful and widespread form of rigour-less clarity? (One that we would want to 'allow' in alignment work?) Or would you instead maintain ... (read more)

I agree that the space  may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space  may well be a more natural one. (It's of course the space of functions , and so a space in which 'model space' naturally sits in some sense. )

You're right about the loss thing; it isn't as important as I first thought it might be. 

It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.

I'm sorry but the fact that it is scalar output isn't explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified fu... (read more)

4Lucius Bushnaq9mo
Fair enough, should probably add a footnote. Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you're effectively just adding extra norm to features in one part of the output over the others. Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect. The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.

I'm not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.

I wrote out the Hessian computation in a comment to one of Vivek's posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because  for me the sum over  was included in the function I called "", but it doesn't affect my main point).

I think the most concrete thing is that the function  - i.e. the `input-output' function of a neural network - should in general have a vector output, but you write things like 

witho... (read more)

4Lucius Bushnaq9mo
It's an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.   In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation, yes. But that's not how standard loss functions usually work. If a loss function did have terms like that, you should indeed get out somewhat different results.  But that seems like a thing to deal with later to me, once we've worked out the behaviour for really simple cases more. A feature to me is the same kind of thing it is to e.g. Chris Olah. It's the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network. I'm not assuming that the function is linear in \Theta. If it was, this whole thing wouldn't just be an approximation within second order Taylor expansion distance, it'd hold everywhere.  In multi-layer networks, what the behavioural gradient is showing you is essentially what the network would look like if you approximated it for very small parameter changes, as one big linear layer.  You're calculating how the effects of changes to weights in previous layers "propagate through" with the chain rule to change what the corresponding feature would "look like" if  it was in the final layer.   Obviously, that can't be quite the right way to do things outside this narrow context of interpreting the meaning of the basin near optima. Which is why we're going to try out building orthogonal sets layer by layer instead. To be clear, none of this is a derivation showing that the L2 norm perspective is the right thing to do in any capacity. It's just a suggestive hint that it might be. We've been searching for the right definition of "feature independence" or "non-redundancy of computations" in neural networks for a while now, to get an elementary unit of neural network

Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.

Thanks for the nice reply. 

I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they're sufficient to ~fully make sense of what's going on. So I don't feel confused about the situation anymore. By "shocking" I meant something more like "calls for an explanation", not "calls for an explanation, and I don't have an explanation that feels adequate". (With added overtones of "horrifying".)

Yeah, OK, I think that helps clarify things for me.

As someone who was working a

... (read more)
5Rob Bensinger1y
Oh, I do think Superintelligence was extremely important. I think Superintelligence has an academic tone (and, e.g., hedges a lot), but its actual contents are almost maximally sci-fi weirdo -- the vast majority of public AI risk discussion today, especially when it comes to intro resources, is much less willing to blithely discuss crazy sci-fi scenarios.

I'm a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I'll give it a go anyway. There's a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:

(I do think it's possible to create a much better intro resource than any that exist today, but 'we can do much better' is compatible with 'it's shocking that the existing material hasn't already finished the job'.)

I feel like I want to ask: Do you really... (read more)

No need to be sheepish, IMO. :) Welcome to the conversation!

Do you really find it "shocking"?

I think it's the largest mistake humanity has ever made, and I think it implies a lower level of seriousness than the seriousness humanity applied to nuclear weapons, asteroids, climate change, and a number of other risks in the 20th-century. So I think it calls for some special explanation beyond 'this is how humanity always handles everything'.

I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbri... (read more)

Thanks again for the reply.

In my notation, something like   or  are functions in and of themselves. The function  evaluates to zero at local minima of 

In my notation, there isn't any such thing as .

But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathe... (read more)

Thanks for the substantive reply.

First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out.  The loss of a model with parameters  can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function  we have: 

i.e. it's  that might be something ... (read more)

1Vivek Hebbar1y
I will split this into a math reply, and a reply about the big picture / info loss interpretation. Math reply: Thanks for fleshing out the calculus rigorously; admittedly, I had not done this.  Rather, I simply assumed MSE loss and proceeded largely through visual intuition.   This is still false!  Edit: I am now confused, I don't know if it is false or not.   You are conflating ∇f l(f(θ)) and ∇θ l(f(θ)).  Adding disambiguation, we have: ∇θ L(θ)=(∇f l(f(θ))) Jθf(θ) Hessθ(L)(θ)=Jθf(θ)T [Hessf(l)(f(θ))] Jθf(θ)+∇f l(f(θ)) D2θf(θ) So we see that the second term disappears if ∇f l(f(θ))=0.  But the critical point condition is ∇θ l(f(θ))=0.  From chain rule, we have: ∇θ l(f(θ))=(∇f l(f(θ))) Jθf(θ) So it is possible to have a local minimum where ∇f l(f(θ))≠0, if ∇f l(f(θ)) is in the left null-space of Jθf(θ).  There is a nice qualitative interpretation as well, but I don't have energy/time to explain it. However, if we are at a perfect-behavior global minimum of a regression task, then ∇f l(f(θ)) is definitely zero. A few points about rank equality at a perfect-behavior global min: 1. rank(Hess(L))=rank(Jf) holds as long as Hess(l)(f(θ)) is a diagonal matrix.  It need not be a multiple of the identity. 2. Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior. 3. If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs. 4. We can extend to larger outputs by having the behavior f be the flattened concatenation of outputs.  The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector.  It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function.  But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits

This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.

I think that your setup is essentially that there is an -dimensional parameter space, let's call it  say, and then for each element  of the training set, we can consider the function ... (read more)

1Vivek Hebbar1y
Thanks for this reply, its quite helpful. Ah nice, didn't know what it was called / what field it's from.  I should clarify that "key result" here just meant "key result of the math so far -- pay attention", not "key result of the whole post" or "profound/original". Yeah, you're right.  Previously I thought G was the Jacobian, because I had the Jacobian transposed in my head.  I only realized that G has a standard name fairly late (as I was writing the post I think), and decided to keep the non-standard notation since I was used to it, and just add a footnote. Yes; this is the whole point of the post.  The math is just a preliminary to get there. Good catch -- it is technically possible at a local minimum, although probably extremely rare.  At a global minimum of a regression task it is not possible, since there is only one behavior vector corresponding to zero loss.  Note that behavior in this post was defined specifically on the training set.  At global minima, "Rank(Hessian(Loss))=Rank(G)" should be true without exception. In  "Flat basin  ≈  Low-rank Hessian  =  Low-rank G  ≈  High manifold dimension": The first "≈" is a correlation.  The second "≈" is the implication "High manifold dimension => Low-rank G". (Based on what you pointed out, this only works at global minima). "Indicates" here should be taken as slightly softened from "implies", like "strongly suggests but can't be proven to imply".  Can you think of plausible mechanisms for causing low rank G which don't involve information loss?

Hi there,

Given that you've described various 'primarily conceptual' projects on the Alignment Team, and given the distinction between Scientists and Engineers, one aspect that I'm unsure about is roughly: Would you expect a Research Scientist on the Alignment Team to necessarily have a minimum level of practical ML knowledge? Are you able to say any more about that? e.g.  Would they have to pass a general Deep Mind coding test or something like that? 

4Rohin Shah1y
Yes, we'd definitely expect some practical ML knowledge, and yes, there would be a coding interview. There isn't a set bar; I could imagine hiring someone who did merely okay on the ML interview and was kinda bad on the coding interview, but both their past work and other interviews demonstrated excellent conceptual reasoning skills, precisely because we have many primarily conceptual projects. But I don't trust conceptual reasoning to get the answer right, if you have very little understanding of how AI systems work in practice, so I still care about some level of ML knowledge.

Hi Ryan, do you still plan for results to come out by May 27? And for those who are successful for the next stage to start June 6th etc.? (That's what is says on the FAQ on the website still). 

1Ryan Kidd1y
Yes, that is currently the plan. If we experience a massive influx of applications in the next two days it is possible this might slightly change, but I doubt it. We will work hard to keep to the announcement and commencement deadlines.

Yes I think you understood me correctly. In which case I think we more or less agree in the sense that I also think it may not be productive to use Richard's heuristic as a criterion for which research directions to actually pursue.

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate. 

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a brea... (read more)

Thanks for the answer. Hum, I'm not sure I'm following your point. Do you mean that you can have both productive mistakes and intuitively compelling explanations when the final (or even intermediary breakthrough) is reached? Then I totally agree. My point was more that if you only use Richard's heuristic, I expect you to not reach the breakthrough because you would have killed in the bud many productive mistakes that actually lead the way there. There's also a very kuhnian thing here that I didn't really mention in my previous comment (except on the Galileo part): the compellingness of an answer is often stronger after the fact, when you work in the paradigm that it lead too. That's another aspect of productive mistakes or even breakthrough: they don't necessarily look right or predict more from the start, and evaluating their consequences is not necessarily obvious.

I agree i.e. I also (fairly weakly) disagree with the value of thinking of 'distilling'  as a separate thing. Part of me wants to conjecture that it's comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard 'unit' is a an original theorem or original result which might be poorly written up but can't really be argued against much. But if we think of the area (I'm thinking predominantly about more conceptual/theoretical alignment) as a 'softer', messier, ongoing discourse full of different arguments fro... (read more)

It could also work here. But I do feel like pointing out that the bounty format has other drawbacks. Maybe it works better when you want a variety of bitesize contributions, like various different proposals? I probably wouldn't do work like Abram proposes - quite a long and difficult project, I expect - for the chance of winning a prize, particularly if the winner(s) were decided by someone's subjective judgement. 

My intuition is that you could break this task down into smaller chunks, like applications of Infra-bayes and musings on why Infra-bayes worked better than existing tools there (or worse!), which someone could do within a couple of weeks, and award bounties for those tasks. Then offer jobs to whomever seems like they could do good distillations.  I think that for a few 100 hour tasks, you might need to offer maybe $50k-$100k dollars. That sounds crazy high? Well AI safety is talent constrained, it doesn't look like much is being done with the money, and MIRI seems to think there's a high discount rate (doom within a decade or two) so money should be spent now on tasks that seem important. 

This post caught my eye as my background is in mathematics and I was, in the not-too-distant past, excited about the idea of rigorous mathematical AI alignment work. My mind is still open to such work but I'll be honest, I've since become a bit less excited than I was. In particular, I definitely "bounced off" the existing write-ups on Infrabayesianism and now without already knowing what it's all about, it's not clear it's worth one's time. So, at the risk of making a basic or even cynical point: The remuneration of the proposed job could be important for getting attention/ incentivising people on-the-fence.

Offering a bounty on what you want seems sensible here. It seemed like it worked OK for ELK proposals, so why not here?