Wiki Contributions


In my accounting, the word "arbitrarily" saved me here. I do think I missed the middle ground of the sandboxed, limited programming environments like and the current version of ChatGPT!

That link is broken for me, did you mean to link to this Lilian Weng tweet?

I'm noticing my confusion about the level of support here. Kara Swisher says that these are 505/700 employees, but the OpenAI publication I'm most familiar with is the autointerpretability paper, and none (!) of the core research contributors to that paper signed this letter. Why is a large fraction of the company anti-board/pro-Sam except for 0/6 of this team (discounting Henk Tillman because he seems to work for Apple instead of OpenAI)? The only authors on that paper that signed the letter are Gabriel Goh and Ilya Sutskever. So is the alignment team unusually pro-board/anti-Sam, or are the 505 just not that large a faction in the company?

[Editing to add a link to the pdf of the letter, which is how I checked for who signed

I appreciate the joke, but I think that Sam Altman is pretty clearly "the biggest name in AI" as far as the public is concerned. His firing/hiring was the leading story in the New York Times for days in a row (and still is at time of writing)!

I hope this doesn't lead to everyone sorting into capabilities (microsoft) vs safety (openai). OpenAI's ownership was designed to preserve safety commitments against race dynamics, but microsoft has no such obligations, a bad track record (Sydney), and now the biggest name in AI. Those dynamics could lead to talent/funding/coverage going to capabilities unchecked by safety, which would increase my p(doom).

Two caveats:

  • We don't know what the Altman/Brockman "advanced AI research team" will actually be doing at Microsoft, and how much independence they'll have. 
  • According to the new OpenAI CEO Emmett Shear, the split wasn't due to "any specific disagreement on safety", but I think that could be the end result.

This is something we're planning to look into! From the paper: 

Future efforts could also try to improve feature dictionary discovery by incorporating information about the weights of the model or dictionary features found in adjacent layers into the training process.

Exactly how to use them is something we're still working on...

Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn't on the training side of the project so these are less informed than Hoagy, Logan, and Aidan's views:

  • +1 to Aidan's answer.
  • I wish we could resolve tied vs untied purely via "use whichever makes things more interpretable by metric X", but right now I don't think our interpretability metrics are fine-grained and reliable enough to make that decision for us yet.
  • I expect a lot of future work will ask these architectural questions about the autoencoder architecture, and like transformers in general will settle on some guidelines of what works best.
  • Tied weights are expressive enough to pass the test of "if you squint and ignore the nonlinearity, they should still work". In particular, (ignoring bias terms) we're trying to make , so we need "", and many matrices satisfy .
  • Tied weights certainly make it easier to explain the autoencoder - "this vector was very far in the X direction, so in its reconstruction we add back in a term along the X direction" vs adding back a vector in a (potentially different) Y direction.
  • Downstream of this, tied weights make ablations make more sense to me. Let's say you have some input A that activates direction X at a score of 5, so the autoencoder's reconstruction is A≈ 5X+[other stuff]. In the ablation, we replace A with A-5X, and if you feed A-5X into the sparse autoencoder, the X direction will activate 0 so the reconstruction will be A-5X≈0X+[different other stuff due to interference]. Therefore the only difference in the accuracy of your reconstruction will be how much the other feature activations are changed by interference. But if your reconstructions use the Y vector instead, then when you feed in A-5X, you'll replace A≈5Y+[other stuff] with A-5X≈0Y+[different other stuff], so you've also changed things by 5X-5Y.
  • If we're abandoning the tied weights and just want to decompose the layer into any sparse code, why not just make the sparse autoencoder deeper, throw in smooth activations instead of ReLU, etc? That's not rhetorical, I honestly don't know... probably you'd still want ReLU at the end to clamp your activations to be positive. Probably you don't need too much nonlinearity because the model itself "reads out" of the residual stream via linear operations. I think the thing to try here is trying to make the sparse autoencoder architecture as similar to the language model architecture as possible, so that you can find the "real" "computational factors".

Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?

Either I'm misunderstanding you or you're misunderstanding me, but I think I've shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don't claim that you can do a self-attention → MLP construction.

Converting an arbitrary MLP layer to a self-attention layer is presumably doable - at least with enough parameters - but remains unknown

This is what I think I show here! Let the unknown be known!

Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).

Yes, this is definitely at an "interesting trivia" level of efficiency.  Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don't see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said "oh we can do interpretability on attention heads but not MLPs", so the conversion of the later into the former might supplement their techniques.

Load More