it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
Stupid beginner question: I noticed that while interesting, many of the posts here are very long and try to go deep into the topic explored often without tldr. I'm just curious - how do the writers/readers find time for it? are they paid? If someone lazy like me wants to participate - is there a more twitter-like Lesswrong version?
my understanding is that they fully separate computation and memory storage. So whhile traditional architectures need some kind of cache to store large amount of data for model partitions from which just a small portion is used for the computation at any single time point, CS2 only requests what it needs so the bandwidth doesnt need to be so big
I am certainly not an expert, but I am still not sure about your claim that it's only good for running small models. The main advantage they claim to have is "storing all model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model." (https://www.cerebras.net/product-cluster/ , weight streaming). So they explicitly claim that it should perform well with large models.
Furthermore, in their white paper (https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf), they claim that the CS-2 architecture is much better suited for sparse models(e.g. by Lottery Ticket Hypothesis) and on page 16 they show that Sparse GPT-3 could be trained in 2-5 days.
This would also align with tweets by OpenAI that Trillion is the new billion, and rumors about the new GPT-4 being similarly big jump as GPT-2 -> GPT-3 was - having colossal number of parameters and sparse paradigm (https://thealgorithmicbridge.substack.com/p/gpt-4-rumors-from-silicon-valley). I could imagine that sparse parameters deliver much stronger results than normal parameters, and this might change scaling laws a bit.
oh and besides IQ tests, i predict it would also be able to pass most current CAPTCHA-like tests (though humans would still be better in some)
Nah...I still believe that the future AGI would invent a time machine and then it invents itself before 2022
I should also make a prediction for the nearer version of GATO to actually answer the questions from the post. So if a new version of GATO appears in next 4 months, I predict:
80% confidence interval: Gato will have 50B-200B params. Context window will be 2-4x larger(similar to GPT-3)
50%: No major algorithmic improvements, RL or memory. Maybe use of perceiver. Likely some new tokenizers. The improvements would come more from new data and scale.
80%: More text,images,video,audio. More games and new kinds of data. E.g. special prompting to do something in a game, draw a picture, perform some action.
75%: Visible transfer learning. Gato trained on more tasks and pre-trained on video would perform better in most but not all games, compared to a model with similar size trained just on the particular task. Language model would be able to descripe shape of objects better after being trained together with images/video/audio.
70%: Chain of thought reasoning would perform better compared to a LLM of similar size. The improvement won't be huge though and I wouldn't expect it to gain some suprisingly sophisticated new LLM capabilities.
80%: It won't be able to play new Atari games similarly to humans, but there would be a visible progress - the actions would be less random and directed towards the goal of the game. With sophisticated prompting, e.g. "Describe first what the goal of this game is, how to play it, what is the best strategy", significant improvements would be seen, but still sub-human.
it is fixed now, thanks!