Why does GPT-3 use the same matrix for word embedding and final predictions? I would expect this to constrain the model, and the only potential upsides I can see are saving parameters (lol) and preserving interpretability (lmao). Other resources like A Mathematical Framework for Transformer Circuits use different embedding/unembedding matrices - their WE and WU. Perhaps this is not necessary for GPT-3 since the final feed-forward network can perform an appropriate linear transformation, and in A Mathematical Framework they are looking at transformers without FFNs. But some properties (e.g. words being linear combinations of other words) cannot be changed by such a linear transformation, so having an entire new unembedding matrix could still add value.
This is called "tied embeddings". You're right that models don't need to have this constraint, and some don't - for instance, GPT-NeoX. I'm not sure whether or not this actually improves performance in practice though.
I don't think the game is an alarming capability gain at all - I agree with LawrenceC's comment below. It's more of a "gain-of-function research" scenario to me. Like, maybe we shouldn't deliberately try to train a model to be good at this? If you've ever played Diplomacy, you know the whole point of the game is manipulating and backstabbing your way to world domination. I think it's great that the research didn't actually seem to come up with any scary generalizable techniques or dangerous memetics, but I think ideally shouldn't even be trying in the first place.
So if streaming works as well as Cereberas claims, GPUs can do that as well or better.
Hmm, I'm still not sure I buy this, after spending some more time thinking about it. GPUs can't stream a matrix multiplication efficiently, as far as I'm aware. My understanding is that they're not very good at matrix-vector operations compared to matrix-matrix because they rely on blocked matrix multiplies to efficiently use caches and avoid pulling weights from RAM every time.
Cerebras says that the CS-2 is specifically designed for fast matrix-vector operations, and uses dataflow scheduling, so it can stream a matrix multiplication by just performing matrix-vector operations as weights stream in. And the weights are getting streamed from external RAM, rather than requested as needed, so there's no round-trip latency gunking up the works like a GPU has when it wants data from RAM.
I agree sparsity (and also probably streaming) will be increasing important; I've actually developed new techniques for sparse matrix multiplication on GPUs.
Cerebras claims that their hardware support for fast matrix-vector multiplication gives a 10x speed boost to multiplying sparse matrices, which could be helpful.
The Andromeda 'supercomputer' has peak performance of 120 pflops dense compared to 512 pflops dense for a single 256 H100 GPU pod from nvidia
I'm not sure if PFLOPs are a fair comparison here though, if I understand Cerebras' point correctly. Like, if you have ten GPUs with one PFLOP each, that's technically the same number of PFLOPs as a single GPU with ten PFLOPs. But actually that single GPU is going to train a lot faster than the ten GPUs because the ten GPUs are going to have to spend time communicating with each other. Especially as memory limitations make you resort to tensor or pipeline parallelism instead of data parallelism. Cerebras claims that to train "10 times faster you need 50 times as many GPUs."
According to this logic what you really care about instead is probably training speed or training speedup per dollar. Then the pitch for Andromeda, unlike a GPU pod, is that those 120 PFLOPS are "real" in the sense that training speed increases linearly with the PFLOPS.
The cerebras wafer-scale chip is a wierd architecture that should excel in the specific niche of training small models at high speed, but that just isn't where the industry is going. It is severely lacking in the large cheap fast off-chip RAM that GPUs have
I'm not sure I totally have a good grasp on this, but isn't this the whole point of Andromeda's weight streaming system? Fast off-chip memory combined with high memory bandwidth on the chip itself? Not sure what would limit this to small models if weights can be streamed efficiently, as Cerebras claims.
Even if I'm right, I'm not sure either of these points change the overall conclusion though. I'd guess Cerebras still isn't economically competitive or they'd be boasting it as you said.
Hmm, I see how that would happen with other architectures, but I'm a bit confused how this is O(n2) here? Andromeda has the weight updates computed by a single server (MemoryX) and then distributed to all the nodes. Wouldn't this be a one-to-many broadcast with O(n) transmission time?
No substantive reply, but I do want to thank you for commenting here - original authors publicly responding to analysis of their work is something I find really high value in general. Especially academics that are outside the usual LW/AF sphere, which I would guess you are given your account age.
I'm not sure exactly where I land on this, but I think it's important to consider that restricting the data companies can train on could influence the architectures they use. Self-supervised autoregressive models a-la GPT-3 seem a lot more benign than full-fledged RL agents. The latter is a lot less data hungry than the former (especially in terms of copyrighted data). There are enough other factors here to not make me completely confident in this analysis, but it's worth thinking about.
This proposal looks really promising to me. This might be obvious to everyone, but I think much better interpretability research is really needed to make this possible in a safe(ish) way. (To verify the shard does develop, isn't misaligned, etc.) We'd just need to avoid the temptation to take the fancy introspection and interpretability tools this would require and use them as optimization targets, which would obviously make them useless as safeguards.
This is definitely the core challenge of the language model approach, and may be the reason it fails. I actually believe language models aren't the most likely approach to achieve superintelligence. But I also place a non-trivial probability on this occurring, which makes it worth thinking about for me.
Let me try to explain why I don't rule this possibility out. Obviously GPT-3 doesn't know more than a human, as evident in its sub-human performance on common tasks and benchmarks. But suppose we instead have a much more advanced system, a near-optimal sequence predictor for human-written text. Your argument is still correct - it can't output anything more than a human would know, because that wouldn't achieve minimum loss on the training data. But does that imply it can't know more than humans? That is, is it impossible for it to make use of facts that humans don't realize as an intermediate step in outputting text that only includes facts humans do realize?
I think not necessarily. As an extreme example, one particular optimal sequence predictor would be a perfect simulation, atom-for-atom, of the entire universe at the time a person was writing the text they wrote. Trivially, this sequence predictor "knows" more than humans do, since it "knows" everything, but it will also never output that information in the predicted text.
More practically, sequence prediction is just compression. More effective sequence prediction means more effective compression. The more facts about the world you know, the less data is required to describe each individual piece of text. For instance, knowing the addition algorithm is a more space-efficient way to predict all strings like "45324 + 58272 =" than memorization. As the size of the training data you're given approaches infinity, assuming a realistic space-bounded sequence predictor, the only way its performance can improve is with better world/text modeling. The fact that humans don't know a certain fact wouldn't prohibit it from being discovered if it allows more efficient sequence prediction.
Will we reach this superhuman point in practice? I don't know. It may take absurd amounts of computation and training data to reach this point, or just more than alternative approaches. But it doesn't seem impossible to me in theory.
Even if we reach this point, this still leaves the original problem - the model will not output anything more than a human would know, even if it has that knowledge internally. But even without fancy future interpretability tools, we may be heading in that direction with things like InstructGPT, where the model was fine-tuned to spit out things it was capable of saying, but wouldn't have said under pure sequence prediction.
This whole argument, together with rapid recent progress, is enough for me to not immediately write off language models, and consider strategies to take advantage of them if this scenario were to occur.