Credit for changing the wording, but I still feel this does not adequately convey how sweeping the impact of the proposal would be if implemented as-is. Foundation model-related work is a sizeable and rapidly growing chunk of active AI development. Of the 15K pre-print papers posted on arXiv under the CS.AI
category this year, 2K appear to be related to language models. The most popular Llama2 model weights alone have north of 500K downloads to date, and foundation-model related repos have been trending on Github for months. "People working with [a few technical labs'] models" is a massive community containing many thousands of developers, researchers, and hobbyists. It is important to be honest about how they will likely be impacted by this proposed regulation.
If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you're actually measuring the effect of scale alone, rather than scale confounded by performance.
It would definitely move the needle for me if y'all are able to show this behavior arising in base models without forcing, in a reproducible way.
Good question. I don't have a tight first-principles answer. The helix puts a bit of positional information in the variable magnitude (otherwise it'd be an ellipse, which would alias different positions) and a bit in the variable rotation, whereas the straight line is the far extreme of putting all of it in the magnitude. My intuition is that (in a transformer, at least) encoding information through the norm of vectors + acting on it through translations is "harder" than encoding information through (almost-) orthogonal subspaces + acting on it through rotations.
Relevant comment from Neel Nanda: https://twitter.com/NeelNanda5/status/1671094151633305602
Very cool! I believe this structure allows expressing the "look back N tokens" operation (perhaps even for different Ns across different heads) via a position-independent rotation (and translation?) of the positional subspace of query/key vectors. This sort of operation is useful if many patterns in the dataset depend on the relative arrangement of tokens (for ex. common n-grams) rather than their absolute positions. Since all these models use absolute positional embeddings, the positional embeddings have to contort themselves to make this happen.
The main takeaway (translated to standard technical language) is it would be useful to have some structured representation of the relationship between terminal values and instrumental values (at many recursive “layers” of instrumentality), analogous to how Bayes nets represent the structure of a probability distribution. That would potentially be more useful than a “flat” representation in terms of preferences/utility, much like a Bayes net is more useful than a “flat” probability distribution.
That’s an interesting and novel-to-me idea. That said, the paper offers [little] technical development of the idea.
I believe Yoav Shoham has done a bit of work on this, attempting to create a formalism & graphical structure similar to Bayes nets for reasoning about terminal/instrumental value. See these two papers:
I think we're more or less on the same page now. I am also confused about the applicability of existing mechanisms. My lay impression is that there isn't much clarity right now.
For example this uncertainty about who's liable for harms from AI systems came up multiple times during the recent AI hearings before the US Senate, in the context of Section 230's shielding of computer service providers from certain liabilities, to what extent that it & other laws extend here. In response to Senator Graham asking about this, Sam Altman straight up said "We’re claiming we need to work together to find a totally new approach. I don’t think Section 230 is the even the right framework."
Parallel distributed processing (as well as "connectionism") is just an early name for the line of work that was eventually rebranded as "deep learning". They're the same research program.