In my accounting, the word "arbitrarily" saved me here. I do think I missed the middle ground of the sandboxed, limited programming environments like you.com and the current version of ChatGPT!
That's all correct
That link is broken for me, did you mean to link to this Lilian Weng tweet?
It was covered in Axios, who also link to it as a separate pdf with all 505 signatories.
I'm noticing my confusion about the level of support here. Kara Swisher says that these are 505/700 employees, but the OpenAI publication I'm most familiar with is the autointerpretability paper, and none (!) of the core research contributors to that paper signed this letter. Why is a large fraction of the company anti-board/pro-Sam except for 0/6 of this team (discounting Henk Tillman because he seems to work for Apple instead of OpenAI)? The only authors on that paper that signed the letter are Gabriel Goh and Ilya Sutskever. So is the alignment team unusually pro-board/anti-Sam, or are the 505 just not that large a faction in the company?
[Editing to add a link to the pdf of the letter, which is how I checked for who signed https://s3.documentcloud.org/documents/24172246/letter-to-the-openai-board-google-docs.pdf ]
I appreciate the joke, but I think that Sam Altman is pretty clearly "the biggest name in AI" as far as the public is concerned. His firing/hiring was the leading story in the New York Times for days in a row (and still is at time of writing)!
I hope this doesn't lead to everyone sorting into capabilities (microsoft) vs safety (openai). OpenAI's ownership was designed to preserve safety commitments against race dynamics, but microsoft has no such obligations, a bad track record (Sydney), and now the biggest name in AI. Those dynamics could lead to talent/funding/coverage going to capabilities unchecked by safety, which would increase my p(doom).
This is something we're planning to look into! From the paper:
Future efforts could also try to improve feature dictionary discovery by incorporating information about the weights of the model or dictionary features found in adjacent layers into the training process.
Exactly how to use them is something we're still working on...
Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn't on the training side of the project so these are less informed than Hoagy, Logan, and Aidan's views:
Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?
Either I'm misunderstanding you or you're misunderstanding me, but I think I've shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don't claim that you can do a self-attention → MLP construction.
Converting an arbitrary MLP layer to a self-attention layer is presumably doable - at least with enough parameters - but remains unknown
This is what I think I show here! Let the unknown be known!
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).
Yes, this is definitely at an "interesting trivia" level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don't see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said "oh we can do interpretability on attention heads but not MLPs", so the conversion of the later into the former might supplement their techniques.