Wiki Contributions



Maybe this is on us for not including enough detail in the post, but I'm pretty confident that you would lose your bet no matter how you operationalised it. We did compare ITO to using the encoder to pick features (using the top k) then optimising the weights on those feature at inference time, and to learning a post hoc scale and to address the 'shrinkage' problem where the encoder systematically underweights features, and gradient pursuit consistently outperformed both of them, so I think that gradient pursuit doesn't just fiddle round with low weight, it also chooses features 'better'.

With respect to your threshold thing; the structure of the specific algorithm we used (gradient pursuit) means that if GP has selected a feature, it tends to assign it quite a high weight, so I don't think that would do much; SAE encoders tend to have much more features close to zero, because it's structurally hard for them to avoid doing this. I would almost turn around your argument; i think that low-activating features in a normal SAE are likely to not be particularly interesting or interpretable either, as the structure of an SAE makes it difficult for them to avoid having features that have interference activate spuriously.

One quirk of gradient pursuit that is a bit weird is that it will almost always choose a new feature which is orthogonal to the span of features selected so far, which does seem a little artificial.

Whether the way that it chooses features better is actually better for interpretability is difficult to say. As we say in the post, we did manually inspect some examples and we couldn't spot any obvious problems with the ITO decomposition, but we haven't done a properly systematic double blind comparison of ITO to encoder 'explanations' in terms of interpretability because it's quite expensive for us in terms of time.

I think that it's too early to say whether ITO is 'really' helping or not, but I am pretty confident it's worth more exploration, which is why we are spreading the word about this specific algorithm in this snippet (even though we didn't invent it). I think training models using GP at train time, getting rid of the SAE framework altogether, is also worth exploring to be honest. But at the moment it's still quite hard to give sparse decompositions an 'interpretability score' which is objective and not too expensive to make, so it's a bit difficult to see how we would evaluate something like this. (I think auto-interp could be a reasonable way of screening ideas like this once we are running it more easily)

I think there is a fairly reasonable theoretical argument that non-SAE decompositions won't work well for superposition (because the NN can't actually be using an iterative algorithm to read features) but I do think that I haven't really seen any empirical evidence that this is either true or false to be honest, and I don't think we should rule out that non-SAE methods would just work loads better; they do work much better for almost every other sparse optimisation algorithm afaik.


Yeah I agree with everything you say; it's just I was trying to remind myself of enough of SLT to give a a 'five minute pitch' for SLT to other people, and I didn't like the idea that I'm hanging it of the ReLU.

I guess the intuition behind the hierarchical nature of the models leading to singularities is the permutation symmetry between the hidden channels, which is kind of an easy thing to understand.

I get and agree with your point about approximate equivalences, though I have to say that I think we should be careful! One reason I'm interested in SLT is I spent a lot of time during my PhD on Bayesian approximations to NN posteriors. I think SLT is one reasonable explanation of why this. never yielded great results, but I think hand-wavy intuitions about 'oh well the posterior is probably-sorta-gaussian' played a big role in it's longevity as an idea.

yeah it's not totally clear what this 'nearly singular' thing would mean? Intuitively, it might be that there's a kind of 'hidden singularity' in the space of this model that might affect the behaviour, like the singularity in a dynamic model with a phase transition. but im just guessing


I'm trying to read through this more carefully this time: how load-bearing is the use of ReLU nonlinearities in the proof? This doesn't intuitively seem like it should be that important (e.g a sigmoid/gelu/tanh network feels like it is probably singular, and it certainly has to be if SLT is going to tell us something important about NN behaviour because changing the nonlinearity doesn't change how NNs behave that much imo), but it does seem to be an important part of the construction you use.


maybe this is really naive (I just randomly thought of it), and you mention you do some obvious stuff like looking at the singular vectors of activations which might rule it out, but could the low-frequency cluster be linked something simple like the fact that the use of ReLUs, GeLUs etc. means the neuron activations are going to be biased towards the positive quadrant of the activation space in terms of magnitude (because negative components of any vector in the activation basis would be cut off). I wonder if the singular vectors would catch this.


I think that it's important to be careful with elaborately modelled reasoning about this kind of thing, because the second order political effects are very hard to predict but also likely to be extremely important, possibly even more important than the direct effect on timelines in some scenarios. For instance, you mention leading labs slowing down as bad (because the leading labs are 'safety conscious' and slowing down dilutes their lead). In my opinion, this is a very simplistic model of the likely effects of this intervention. There are a few reasons for this:

  • Taking drastic unilateral action creates new political possibilities. A good example is Hinton and Bengio 'defecting' to advocating strongly for AI safety in public; I think this has had a huge effect on ML researchers and governments in taking things seriously, even though the direct effect on AI research is probably neglible. For instance, Hinton in particular made me personally take a much more serious look at AI safety related arguments, and this has influenced me trying to re-orient my career in a more safety-focused direction. I find it implausible that a leading AI lab shutting themselves down for safety reasons would have no second order political effects along these lines, even if the direct impact was small: if there's one lesson I would draw from covid and the last year or so of AI discourse, it's that the overton window is much more mobile than people often think. A dramatic intervention like this would obviously have uncertain outcomes, but could trigger unforeseen possibilities. Unilateral action that disadvantages the actor also makes a political message much more powerful. There's a lot of skepticism when labs like Anthropic talk loudly about AI risk because of the objection 'if it's so bad why are you making it'. While there are technical arguments one can make that there are good reasons to simultaneously work on safety and ai development, it makes communicating this message much harder and people will understandably have doubts about your motives.

  • 'we can't slow down because someone else will do it anyway' - I actually this is probably wrong: in a counterfactual world where OpenAI didn't throw lots of resources and effort into language models, I'm not actually sure someone else would have bothered to continue scaling them, at least not for many years. Research is not a linear process and a field being unfashionable can delay progress by a considerable amount; just look at the history of neural network research! I remember many people in academia being extremely skeptical of scaling laws around the time they were being published; if OpenAI hadn't pushed on it it could have taken years to decades for another lab to really throw enough resources at that hypothesis if it had become unfashionable for whatever reason.

  • I'm not sure it's always true that other labs catch up if the leading ones stop: progress also isn't a simple function of time; without people trying to scale massive GPU clusters you don't get practical experience with the kind of problems such systems have, production lines don't re-orient themselves towards the needs of such systems, etc. etc. There are important feedback loops in this kind of process that the big labs shutting down could disrupt, such as attracting more talent and enthusiasm into the field. It's also not true that all ML research is a monolithic line towards 'more AGI' - from my experience of academia, many researchers would have quite happily worked on small specialised systems in a variety of domains for the rest of time.

I think many of these arguments also apply to arguments against 'US moratorium now' - for instance, it's much easier to get other countries to listen to you if you take unilateral actions, as doing so is a costly signal that you are serious.

this isn't neccesarily to say that I think a US moratorium or a leading lab shutting down would actually be a useful thing, just that I don't think it's cut and dry that it wouldn't. Consider what would happen if a leading lab actually did shut themselves down - would there really be no political consequences that would have a serious effect on the development of AI? I think that your argument makes a lot of sense if we are considering 'spherical AI labs in a vacuum', but I'm not sure that's how it plays out in reality.


Any post along the lines of yours needs a 'political compass' diagram lol.

I mean it's hard to say what Altman would think in your hypothetical debate: assuming he has reasonable freedom of action at OpenAI his revealed preference seems to be to devote <= 20% of the resources available to his org to 'the alignment problem'. If he wanted to assign more resources into 'solving alignment' he could probably do so. I think Altman thinks he's basically doing the right thing in terms of risk levels. Maybe that's a naive analysis, but I think it's probably reasonable to take him more or less at face value.

I also think that it's worth saying that easily the most confusing argument for the general public is exactly the Anthropic/OpenAI argument that 'AI is really risky but also we should build it really fast'. I think you can steelman this argument more than I've done here, and many smart people do, but there's no denying it sounds pretty weird, and I think it's why many people struggle to take it at face value when people like Altman talk about x-risk - it just sounds really insane!

In constrast, while people often think it's really difficult and technical, I think yudkowsky's basic argument (building stuff smarter than you seems dangerous) is pretty easy for normal people to get, and many people agree with general 'big tech bad' takes that the 'realists' like to make.

I think a lot of boosters who are skeptical of AI risk basically think 'AI risk is a load of horseshit' for various not always very consistent reasons. It's hard to overstate how much 'don't anthropomorphise' and 'thinking about AGI is distracting sillyness by people who just want to sit around and talk all day' are frequently baked deep into the souls of ML veterans like LeCun. But I think people who would argue no to your proposed alignment debate would, for example, probably strongly disagree that 'the alignment problem' is like a coherent thing to be solved.


Maybe I shouldn't have used EY as an example, I don't have any special insight into how he thinks about AI and power imbalances. Generally I get the vibe from his public statements that he's pretty libertarian and thinks pros outweigh cons on most technology which he thinks isn't x-risky. I think I'm moderately confident that hes more relaxed about, say, misinformation or big tech platforms dominance than (say) Melanie Mitchell but maybe i'm wrong about that.


I think there's some truth to this framing, but I'm not sure that people's views cluster as neatly as this. In particular, I think there is a 'how dangerous is existential risk' axis and a 'how much should we worry about AI and Power' axis. I think you rightly identify the 'booster' cluster (x-risk fake, AI +power nothing to worry about) and 'realist' (x-risk sci-fi, AI + power very concerning) but I think you are missing quite a lot of diversity in people's positions along other axes that make this arguably even more confusing for people. For example, I would characterise Bengio as being fairly concerned about both x-risk and AI+power, wheras Yudkowsky is extremely concerned about x-risk and fairly relaxed about AI+power.

I also think it's misleading to group even 'doomers' as one cluster because there's a lot of diversity in the policy asks of people who think x-risk is a real concern, from 'more research needed' to 'shut it all down'. One very important group you are missing are people who are simultaneously quite (publicly) concerned about x-risk, but also quite enthusiastic about pursuing AI development and deployment. This group is important because it includes Sam Altman, Dario Amodei and Demis Hassabis (leadership of the big AI labs), as well as quite a lot people who work developing AI or work on AI safety. You might summarise this position as 'AI is risky, but if we get it right it will save us all'. As they are often working at big tech, I think these people are mostly fairly un-worried or neutral about AI + power. This group is obviously important because they work directly on the technology, but also because this gives them a loud voice in policy and the public sphere. You might think of this as a 'how hard is mitigating x-risk' axis. This is another key source of disagreement : going from public statements alone, I think (say) Sam Altman and Eliezer Yudkowsky agree on the 'how dangerous' axis and are both fairly relaxed Silicon Valley libertarians on the 'AI+power' axis, and mainly disagree on how difficult is it to solve x-risk. Obviously people's disagreements on this question have a big impact on their desired policy!


Well maybe you should read the book! I think that there are a few concrete points you can disagree on.

One thing I think about a lot is: are we sure this is unique, or did something else like luck or geography somehow play an important role in one (or a handful) of groups of sapiens happening to develop some strong (or "viral") positive-feedback cultural learning mechanisms that eventually dramatically outpaced other creatures?

I'm not an expert, but I'm not so sure that this is right; I think that anatomically modern humans already had significantly better abilities to learn and transmit culture than other animals, because anatomically modern humans generally need to extensively prepare their food (cooking, grinding etc.) in a culturally transmitted way. So by the time we get to sapiens we are already pretty strongly on this trajectory.

I think there's an element of luck: other animals do have cultural transmission (for example elephants and killer whales) but maybe aren't anatomically suited to discover fire and agriculture. Some quirks of group size likely also play a role. It's definitely a feedback loop though; once you are an animal with culture, then there is increased selection pressure to be better at culture, which creates more culture etc.

If Homo sapiens are believed to have originated around 200,000 years ago, but only developed agricultural techniques around 12,000 years ago, the earliest known city 9,000 years ago, and only developed a modern-style writing system maybe 5,000 years ago, are we sure that those humans who lived for 90%+ of human "pre-history" without agriculture, large groups, and writing systems would look substantially more intelligent to us than chimpanzees?

I'm gonna go with absolutely yes, see my above comment about anatomically modern humans and food prep. I think you are severely under-estimating the sophistication of hunter-gatherer technology and culture!

The degree to which 'objective' measures of intelligence like IQ are culturally specific is an interesting question.


If people are reading this thread and want to read this argument in more detail: the (excellent) book 'The Secret of our Success' by Joseph Henrich (astral codex 10 review/summary here makes this argument in a very compelling way. There is a lot of support for the idea that the crucial 'rubicon' that separates chimps from people is cultural transmission which enables the gradual evolution of strategies over periods longer than an individual lifetime rather than any 'raw' problem solving intelligence. In fact according to Heinrich there are many ways in which humans are actually worse than chimps in some measures of raw intelligence: chimps have better working memory and faster reactions for complex tasks in some cases, and they are better than people at finding Nash equilibria which require randomising your strategy. But humans are uniquely able to learn behaviours from demonstration and forming larger groups which enable the gradual accumulation of 'cultural technology', which then allowed a runway of cultural-genetic co-evolution (e.g food processing technology -> smaller stomachs and bigger brains -> even more culture -> bigger brains even more of an advantage etc.) It's hard to appreciate how much this kind of thing helps you think; for instance, most people can learn maths but few would have invented arabic numerals by themselves. Similarly, having a large brain by itself is actually not super useful without the cultural superstructure: most people alive today would quickly die if dropped into the ancestral environment without the support of modern culture unless they could learn from hunter-gatherers (see Henrich for many examples of this happening to European explorers!). For instance, i like to think I'm a pretty smart guy but I have no idea how to make e.g bronze or stone tools, and it's not obvious that my physics degree would help me figure it out! Henrich also makes the case for the importance of this with some slightly chilling examples of cultures that lost their ability to make complex technology (e.g boats) when they fell below a critical population size and became isolated.

It's interesting to consider the implications for AI: I'm not very sure about this. On the one hand LLMs clearly have superhuman ability to memorise facts, but I'm not sure if this means they can learn new tasks or information particularly easily. On the other it seems likely that LLMs are taking pretty heavy advantage of the 'culture overhang' of the internet! I don't know if it really makes sense to think of their abilities here as strongly superhuman: if you magically had the compute and code to try to train gpt-n in 1950 it's not obvious you could have got it to do very much, without the internet for it to absorb.

Load More