All of Neel Nanda's Comments + Replies

I appreciate the feedback! I have since bought a graphics tablet :) If you want to explore induction heads more, you may enjoy this tutorial

Any papers you're struggling to find?

Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).

Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)

Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:

  • They create a synthetic dataset for lit and unlit rooms with styleGAN. They exploit the fact that the GAN has disentangled and meaningful directions in its latent space, that can be individually edited. They find a lighting latent automatically, by taking noise that produces rooms, editing each latent in turn and looking for big changes specifically on light pixels
    • StyleGAN does not have a text input, and there's no mention of prompting (as far as I can tell - I
... (read more)
4Sam Marks17d
Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.

Thanks! Yeah, I hadn't seen that but someone pointed it out on Twitter. Feels like fun complimentary work

I'm so torn on this paper -I think it makes a reasonable point that many claims of emergence are overrated and that it's easy to massage metrics into a single narrative. But also, I think the title and abstract are overclaiming clickbait - obviously models have emergent abilities!! Chain of thought and few shot learning are just not a thing smaller models can do. Accuracy is sometimes the right metric, etc. It's often overhyped, but this paper way overclaims

Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic in... (read more)

1the gears to ascension1mo
All correct claims about my viewpoint. I'll dm you another detail.

To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).

Why do you believe this? It's fairly plausible to me that "train an AI to use interpretability tools to show that this other AI is being deceptive" is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me

1Joe_Collman1mo
The problem is robustly getting the incentive to show that the other AI is being deceptive. Giving access to the weights, activations and tools may give debaters the capability to expose deception - but that alone gets you nothing. You're still left saying: So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want. Similarly, "...and avoid cooperation" is essentially the entire problem. To be clear, I'm not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I'm less certain on (1)) I'm am saying that there's no reason to predict anything along these lines should catch all such instances. I see no reason to think it'll scale. Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you'll train an AI to detect [things that fit your definition of deception], and we die to things that didn't fit your definition.

Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)

Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022)Nanda et al. (2023)Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.

If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I'm particularly excited about

Some recommended ways to upskill at empirical research (roughly in order):

For people specifically interested in getting into mechanistic interpretability, my guide to getting started may be useful - it's much more focused on the key, relevant parts of deep learning, with a bunch more interpretability specific stuff

Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excelle

... (read more)
2NicholasKross1mo
This is interesting; I'm still looking for my own (I think?) "comparative advantage" in this area. Some mental motions are very easy, while some "trivial" tasks feel harder (or would require me to already be involved full-time, leading to a chicken-and-egg problem).

How are OpenAI training these tokenizers?! I'm surprised they still have weird esoteric tokens like these in there, when presumably there's eg a bunch of words that are worth learning

but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes

Empirically, many people new to the field get very paralysed and anxious about fears of doing accidental harm, in a way that I believe has significant costs. I haven't fully followed the specific model you outline, but it seems to involve ridiculously hard questions around the downstream consequences of your work, which I struggle to robustly apply to my work (indirect effects are really hard man!). Ditto, telling someone that they need t... (read more)

Interesting, thanks for the context. I buy that this could be bad, but I'm surprised that you see little upside - the obvious upside esp for great work like transformer circuits is getting lots of researchers nerdsniped and producing excellent and alignment relevant interp work. Which seems huge if it works

I want to say that I agree the transformer circuits work is great, and that I like it, and am glad I had the opportunity to read it! I still expect it was pretty harmful to publish.

Nerdsniping goes both ways: you also inspire things like the Hyena work trying to improve architectures based on components of what transformers can do.

I think indiscriminate hype and trying to do work that will be broadly attention-grabbing falls on the wrong side, likely doing net harm. Because capabilities improvements seem empirically easier than understanding them, and ther... (read more)

Some examples of justifications I have given to myself are “You’re so new to this, this is not going to have any real impact anyway”,

I think this argument is just clearly correct among people new to the field - thinking that your work may be relevant to alignment is motivating and exciting and represents the path to eventually doing useful things, but it's also very likely to be wrong. Being repeatedly wrong is what improvement feels like!

People new to the field tend to wildly overthink the harms of publishing, in a way that increases their anxiety and makes them much more likely to bounce off. This is a bad dynamic, and I wish people would stop promoting it

5habryka1mo
As someone who is quite concerned about the AI Alignment field having had a major negative impact via accelerating AI capabilities, I also agree with this. It's really quite unlikely for your first pieces of research to make a huge difference. I think the key people who I am worried will drive forward capabilities are people who have been in the field for quite a while and have found traction on the broader AGI problems and questions (as well as people directly aiming towards accelerating capabilities, though the worry there is somewhat different in nature). 
4marc/er1mo
It's fine to make the mistake of publishing something if the mistake you made was assuming "this is great research", but if the mistake was "this is safe to publish because I'm new to research", the consequences can be irreversible. I probably fall into the category of 'wildly overthinking the harms of publishing due to inexperience', but it seems to me like a simple assessment using the ABC model I outlined in the post should take only a few minutes and could quickly inform someone of whether or not they might want to show their research to someone more experienced before publishing. I am personally having this dilemma. I have something I want to publish, but I'm unsure of whether I should listen to the voice telling me "you’re so new to this, this is not going to have any real impact anyway" or the voice that's telling me "if it does have some impact or was hypothetically implemented in a generally intelligent system this could reduce extinction risk but inflate s-risk". It was a difficult decision, but I decided I would rather show someone more experienced, which is what I am doing currently.  This post was intended to be a summary of why/how I converged upon that decision.

And I think I broadly see people overestimating the benefits publishing their work relative to keeping it within a local cluster.

I'm surprised by this claim, can you say more? My read is weakly that people in interp under publish to wider audiences (eg getting papers into conferences), though maybe that people overpublish blog posts? (Or that I try too hard to make things go viral on Twitter lol)

3James Payor1mo
I'm perhaps misusing "publish" here, to refer to "putting stuff on the internet" and "raising awareness of the work through company Twitter" and etc. I mostly meant to say that, as I see it, too many things that shouldn't be published are being published, and the net effect looks plausibly terrible with little upside (though not much has happened yet in either direction). The transformer circuits work strikes me this way, so does a bunch of others. Also, I'm grateful to know your read! I'm broadly interested to hear this and other raw viewpoints, to get a sense of how things look to other people.
7Noosphere891mo
I disagree with James Payor on people overestimating publishing interpretability work, and I think it's the opposite: People underestimate how good publishing interpretability work is, primarily because a lot of people on LW view interpretability work as being solved by a single clean insight, when this is usually not the case. To quote 1a3orn: In general, I think interpretability research is net positive because capabilities will probably differentially progress towards more understandable models, where we are in a huge bottleneck right now for alignment.

Thanks, that looks really useful! Do you have GPU price performance numbers for lower precision training? Models like Chinchilla were trained in bf16, so that seems a more relevant number.

2Jsevillamol1mo
Thanks Neel! The difference between tf16 and FP32 comes to a x15 factor IIRC. Though also ML developers seem to prioritise other characteristics than cost effectiveness when choosing GPUs like raw performance and interconnect, so you can't just multiply the top price performance we showcase by this factor and expect that to match the cost performance of the largest ML runs today. More soon-ish.

Thanks! I also feel more optimistic now about speed research :) (I've tried similar experiments since, but with much less success - there's a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I'd be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...

Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk ... (read more)

Er, hmm. To me this feels like a pretty uncontroversial claim when discussing a small model on an algorithmic task like this. (Note that the model is literally trained on uniform random legal moves, it's not trained on actual Othello game transcripts). Though I would agree that eg "literally all that GPT-4 cares about is predicting the next token" is a dubious claim (even ignoring RLHF). It just seems like Othello-GPT is so small, and trained on such a clean and crisp task that I can't see it caring about anything else? Though the word care isn't really we... (read more)

4TurnTrout1mo
This statement seems fine, yeah!  (Rereading my initial comment, I regret that it has a confrontational tone where I didn't intend one. I wanted to matter-of-factly state my concern, but I think I should have prefaced with something like "by the way, not a huge deal overall, but I think your statement here isn't known to be strictly true." Edited.)

I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don't weight this objection as highly.

Interesting, I hadn't seen that objection before! Can you say more? (Though maybe not if you aren't as convinced by it any more). To me, it'd be that there's many worlds but they all share some commonalities and those commonalities are modelled. Or possibly that the model separately simulates the different worlds.

2TurnTrout1mo
So, first, there's an issue where the model isn't "remembering" having "seen" all of the text. It was updated by gradients taken over its outputs on the historical corpus. So there's a subtlety, such that "which worlds are consistent with observations" is a wrongly-shaped claim. (I don't think you fell prey to that mistake in OP, to be clear.) Second, on my loose understanding of metaphysics (ie this is reasoning which could very easily be misguided), there exist computable universes which contain entities training this language model given this corpus / set of historical signals, such that this entire setup is specified by the initial state of the laws of physics. In that case, the corpus and its regularities ("dogs" and "syntax" and such) wouldn't necessarily reflect the world the agent was embedded in, which could be anything, really. Like maybe there's an alien species on a gas giant somewhere which is training on fictional sequences of tokens, some of which happen to look like "dog". Of course, by point (1), what matters isn't the corpus itself (ie what sentences appear) but how that corpus imprints itself into the network via the gradients. And your post seems like evidence that even a relatively underspecified corpus (sequences of legal Othello moves) appears to imprint itself into the network, such that the network has a world model of the data generator (i.e. how the game works in real life).   Does this make sense? I have some sense of having incommunicated poorly here, but hopefully this is better than leaving your comment unanswered.

Refusing to engage billionaires on twitter - especially ones that are sufficiently open to being convinced that they will drop $44 billion for something as pedestrian as a social media company. 

This one isn't obvious to me - having billionaires take radical action can be very positive or very negative (and last time this was tried on Elon he founded OpenAI!)

The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.

I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of "these models are hard to interpret because of the domain, not because of the structure" though - interpretability is really fucking hard! It's possible, but these models are weird and cursed and rife with bullshit l... (read more)

61a3orn2mo
Yeah, you're just right about vibes. I was trying to give "possible but hard" vibes, and the end result just tilts too far one way and doesn't speak enough about concrete difficulties.

If you feel like you have enough close friends to satisfy you, then more power to you! It's not my job to tell you how to live your life if you're happy with it

1papetoast2mo
I am mainly just curious how other people live their lives. It is interesting to know diverse humans really are. Also I may just be stuck in a local optimum, then it would be at least nice to know there are better local optimums, even if it would take me too much effort to change my way of life.

Idk, I feel like GPT4 is capable of tool use, and also capable of writing enough code to make its own tools.

1ws27a2mo
I agree that it's capable of doing that, but it just doesn't do it. If you ask it to multiply a large number, it confidently gives you some incorrect answer a lot of the time instead of using it's incredible coding skills to just calculate the answer. If it was trained via reinforcement learning to maximize a more global and sophisticated goal than merely predicting the next word correctly or avoiding linguistic outputs that some humans have labelled as good or bad, it's very possible it would go ahead and invent these tools and start using them, simply because it's the path of least resistance towards its global goal. I think the real question is what that global goal is supposed to be, and maybe we even have to abandon the notion of training based on reward signals altogether. This is where we get into very murky and unexplored territory, but it's ultimately where the research community has to start looking. Just to conclude on my own position; I absolutely believe that GPT-like systems can be one component of a fully fledged AGI, but there are other crucial parts missing currently, that we do not understand in the slightest.

Oh huh, I actually really like Quantifiable Altruism

Great article! This helped me reframe some of the strong negative reactions to Yudkowsky's article on Twitter

8Noosphere892mo
I suspect a lot of the negative reactions to Yudkowsky's article isn't about norms, exactly, but rather a disagreement of how far we should be willing to go to slow down AI. Yudkowsky is on the extreme end of the spectrum, which views airstrikes leading to global nuclear warfare as okay if AI is slowed down. Suffice it to say, if you don't believe that doom is certain, then you will have massive issues with going this far for AI safety.

I think if we imagine an n-gram model where n approaches infinity and the size of the corpus we train on approaches infinity, such a model is capable of going beyond even GPT. Of course it's unrealistic, but my point simply is that surface level statistics in principle is enough to imitate intelligence the way ChatGPT does.

Sure, in a Chinese room style fashion, but IMO reasoning + internal models have significantly different generalisation properties, and also are what actually happen in practice in models rather than an enormous table of N-Grams. And I... (read more)

1ws27a2mo
I am happy to consider a distinction between world models and n-gram models, I just still feel like there is a continuum of some sort if we look closely enough. n-gram models are sort of like networks with very few parameters. As we add more parameters to calculate the eventual probability in the softmax layer, at which point do the world models emerge. And when do we term them world models exactly. But I think we're on the same page with regards to the chess example. Your formulation of "GPT-4 does not care about learning chess" is spot on. And in my view that's the problem with GPT in general. All it really cares about is predicting words.

Surely any capabilities researcher concerned enough to be willing to do this should just switch to safety-relevant research? (Also, IMO the best AI researchers tend not to be in this for the money)

Lol. This is a surprisingly decent summary, and the weaknesses are correctly identified things I did not try to cover

I tried to be explicit in the post that I don't personally care all that much about the world model angle - Othello-GPT clearly does form a world model, it's very clear evidence that this is possible. Whether it happens in practice is a whole other question, but it clearly does happen a bit.

They are still statistical next token predictors, it's just the statistics are so complicated it essentially becomes a world model. The divide between these concepts is artificial.

I think this undersells it. World models are fundamentally different from surface leve... (read more)

1ws27a2mo
I think if we imagine an n-gram model where n approaches infinity and the size of the corpus we train on approaches infinity, such a model is capable of going beyond even GPT. Of course it's unrealistic, but my point simply is that surface level statistics in principle is enough to imitate intelligence the way ChatGPT does.  Of course, literally storing probabilities of n-grams is a super poorly compressed way of doing things, and ChatGPT clearly finds more efficient solutions as it moves through the loss landscape trying to minimize next token prediction error. Some of those solutions are going to resemble world models in that features seem to be disentangled from one another in ways that seem meaningful to us humans or seem to correlate with how we view the world spatially or otherwise.  But I would argue that that has likely been happening since we used multilayer perceptrons for next word prediction in the 80s or 90s. I don't think it's so obvious exactly when something is a world model and when it is not. Any neural network is an algorithm in the sense that the state of node A determines the state of node B (setting aside the randomness of dropout layers).  Any neural network is essentially a very complex decision tree. The divide that people are imagining between rule-based algorithmic following of a pattern and neural networks is completely artificial. The only difference is how we train the systems to find whatever algorithms they find. To me, it would be interesting if ChatGPT developed an internal algorithm for playing chess (for example), such that it could apply that algorithm consistently no matter the sequence of moves being played. However, as we know, it does not do this. What might happen is that ChatGPT develops something akin to spatial awareness of the chess board that can perhaps be applied to a very limited subset of move orders in the game.  For example, it's possible that it will understand that if e3 is passive and e4 is more ambitious,

Where is token deletion actually used in practice? My understanding was that since the context window of GPT4 was insanely long, users don't tend to actually exceed it such that you need to delete tokens. And I predict that basically all interesting behaviour happens in that 32K tokens without needing to account for deletion

Are there experiments with these models that show that they're capable of taking in much more text than the context window allows?

6Cleo Nardo2mo
I've spoken to some other people about Remark 1, and they also seem doubtful that token deletion is an important mechanism to think about, so I'm tempted to defer to you. But on the inside view: The finite context window is really important. 32K is close enough to infinity for most use-cases, but that's because users and orgs are currently underutilising the context window. The correct way to utilise the 32K context window is to fill it with any string of tokens which might help the computation. Here's some fun things to fill the window with — * A summary of facts about the user. * A summary of news events that occurred after training ended. (This is kinda what Bing Sydney does — Bing fills the context window with the search results key words in the user's prompt.) * A "compiler prompt" or "library prompt". This is a long string of text which elicits high-level, general-purpose, quality-assessed simulacra which can then be mentioned later by the users. (I don't mean simulacra in a woo manner.) Think of "import numpy as np" but for prompts. * Literally anything other than null tokens. I think no matter how big the window gets, someone will work out how to utilise it. The problem is that the context window has grown faster than prompt engineering, so no one has realised yet how to properly utilise the 32K window. Moreover, the orgs (Anthropic, OpenAI, Google) are too much "let's adjust the weights" (fine-tuning, RLHF, etc), rather than "let's change the prompts".

These bugs should be fixed, thanks for flagging!

Thanks! Yes, your description of zero ablation is correct. I think positive or negative is a matter of convention? To me "positive = is important" and "negative = damaging" is the intuitive way round,which is why I set it up the way I did.

And yeah, I would be excited to see this applied to mean ablation!

Thanks for noting the bugs, I should really freeze the demos on a specific version of the library...

1Xander Davies2mo
Makes sense! Depends on if you're thinking about the values as "estimating zero ablation" or "estimating importance."

Er, maybe if we get really good at doing patching-style techniques? But there's definitely not an obvious path - I more see lie detectors as one of the ultimate goals of mech interp, but whether this is actually possible or practical is yet to be determined.

Oh, ugh, Typeguard was updated to v3 and this broke things. And the circuitsvis import was a mistake. Should be fixed now, thanks for flagging!

2Charlie Steiner2mo
Thanks for the cool notebook!

This seems completely negligible to me, given how popular ChatGPT was. I wouldn't worry about it

If people want concrete mechanistic interpretability projects to work on, my 200 concrete open problems in mechanistic interpretability is hopefully helpful!

4Victor Levoso3mo
Exactly, It's already linked on the project ideas channel of the discord server. Part of the reason I wanted to do this is that It seems to me that there's a lot of things of that list that people could be working on, and apparently there's a lot of people who want to work on MI going by number of people that applied to the Understanding Search in Transformers project in AI safety camp, and whats missing is some way of taking those people and get them to actually work on those projects.  

I give them a lot of credit for, to my eyes, realising this was a big deal way earlier than almost anyone else, doing a lot of early advocacy, and working out some valuable basic ideas, like early threat models, ways in which standard arguments and counter-arguments were silly, etc. I think this kind of foundational work feels less relevant now, but is actually really hard and worthwhile!

(I don't see much recent stuff I'm excited about, unless you count Risks from Learned Optimisation)

I'm confused by this example. This seems exactly the kind of time where an averaged point estimate is the correct answer. Say there's a 50% chance the company survives and is worth $100 and a 50% chance it doesn't and is worth $0. In this case, I am happy to buy or sell the price at $50.

Doing research to figure out it's actually an 80% chance of $100 means you can buy a bunch and make $30 in expected profit. This isn't anything special though - if you can do research and form better beliefs than the market, you should make money. The different world models don't seem relevant here to me?

I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)

Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)

Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewh

... (read more)

Many of these tokens are unprintable (i.e., they don't display and I don't know what they are).

The first 256 characters are the 256 ASCII characters (each 1 byte).  A bunch of them are basically never used (they exist so that an arbitrary string of bytes can be broken down into valid tokens)

Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook - find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour - something you could write code to solve). And then do the... (read more)

1redhatbluehat2mo
Thank you for this response! 

I think that what you're saying is correct, in that ChatGPT is trained with RLHF, which gives feedback on the whole text, not just the next token. It is true that GPT-3 outputs the next token and is trained to be myopic. And I think that your arguments seem suspect to me, just because a model takes steps that are in practice part of a sensible long term plan, does not mean that the model is intentionally forming a plan. Just that each step is the natural thing to myopically follow from before.

0Bill Benzon3mo
Oh, I have little need for the word “plan,” but it’s more convenient than various circumlocutions. Whatever it is that I’ve been calling a plan is smeared over those 175B weights and, as such, is perfectly accessible to next-token myopia. (Still, check out this tweet stream [https://twitter.com/charleswangb/status/1628136656762195971?s=20] by Charles Wang.) It’s just that, unless you’ve got some sophistication ­– and I’m slowly moving in that direction ­– saying that transformers work by next-token prediction is about as informative as saying that a laptop works by shuffling data and instructions back and forth between the processor and memory. Both statements are true, but not very informative. And when “next-token-prediction” appears in the vicinity of “stochastic parrots” or “auto-complete on steroids,” then we’ve got trouble. In that context the typical reader of, say The New York Times or The Atlantic, is likely to think of someone flipping coins or of a bunch of monkey’s banging away on typewriters. Or, maybe they’ll think of someone throwing darts at a dictionary or reaching blindly into a bag full of words, which aren’t very useful either.  Of course, here in this forum, things are different. Which is why I posted that piece here. The discussion has helped me a lot. But it’s going to take a lot of work to figure out how to educate the general reader. Thanks for the comment.

Really nice post! I think this is an important point that I've personally been confused about in the past, and this is a great articulation (and solid work for 2 hours!!)

5LawrenceC3mo
Thanks! (As an amusing side note: I spent 20+ minutes after finishing the writeup trying to get the image from the recent 4-layer docstring circuit post to preview properly the footnotes, and eventually gave up. That is, a full ~15% of the total time invested was spent on that footnote!)

I'm a bit confused why this happens, if the circuit only "needs" three layers of composition

I trained these models on only 22B tokens, of which only about 4B was Python code, and their residual stream has width 512. It totally wouldn't surprise me if it just didn';t have enough data or capacity in 3L, even though it was technically capable.

2LawrenceC3mo
Ah, that makes sense!

Thanks for this post! I'm not sure how much I expect this to matter in practice, but I think that the underlying point of "sometimes the data distribution matters a lot, and ignoring it is suspect" seems sound and well made.

I personal think it's clear that 1L attn-only models are not literally just doing skip trigrams. A quick brainstorm of other things I presume they're doing:

  • Skip trigrams with positional decay - it's easy enough to add a negative term to the attention scores that gets bigger the further away the source token is. For skip trigrams like
... (read more)

Personal correspondance with someone who worked on it.

I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team

3LawrenceC4mo
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!

Oh wait, that FAQ is actually nothing to do with GPT-3. That's about their embedding models, which map sequences of tokens to a single vector, and they're saying that those are normalised. Which is nothing to do with the map from tokens to residual stream vectors in GPT-3, even though that also happens to be called an embedding

1Jessica Rumbelow4mo
Aha!! Thanks Neel, makes sense. I’ll update the post
Load More