IRV is an extremely funky voting system, but almost anything is better than Plurality. I very much enjoyed Ka-Ping Yee's voting simulation visualizations, and would recommend the short read for anyone interested.
I have actually made my own simulation visualization, though I've spent no effort annotating it and the graphic isn't remotely intuitive. It models a single political axis (eg. ‘extreme left’ to ‘extreme right’) with N candidates and 2 voting populations. The north-east axis of the graph determines the centre of one voting population, and the south-east axis determines the centre of the other (thus the west-to-east axis is when the voting populations agree). The populations have variances and sizes determined by the sliders. The interesting thing this has taught me is that IRV/Hare voting is like an otherwise sane voting system but with additional practically-unpredictable chaos mixed in, which is infinitely better than the systemic biases inherent to plurality or Borda votes. In fact, if you see advantages in sortition, this might be a bonus.
The latter is the source for human perplexity being 12. I should note that it tested on the 1 Billion Words benchmark, where GPT-2 scored 42.2 (35.8 was for Penn Treebank), so the results are not exactly 1:1.
FLOPS don't seem to me a great metric for this problem; they are often very sensitive to the precise setup of the comparison, in ways that often aren't very relevant (the Donkey Kong comparison emphasized this), and the architecture of computers is fundamentally different to that of brains. What seems like a more apt and stable comparison is to compare the size and shape of the computational graph, roughly the tuple (width, depth, iterations). This seems like a much more stable metric, since scale-based metrics normally only change significantly when you're handling the problem in a semantically different way. In the example, hardware implementations of Donkey Kong and various sorts of software emulation (software interpreter, software JIT, RTL simulation, FPGA) will have very different throughputs on different hardware, and the setup and runtime overheads for each might be very different, but the actual runtime computation graphs should look very comparable.
This also has the added benefit of separating out hypotheses that should naturally be distinct. For example, a human-sized brain at 1x speed and a hamster brain at 1000x speed are very different, yet have seemingly similar FLOPS. Their computation graphs are distinct. Technology comparisons like FPGAs vs AI accelerators become a lot clearer from the computation graph perspective; an FPGA might seem at a glance more powerful from a raw OP/s perspective, but first principles arguments will quickly show they should be strictly weaker than an AI accelerator. It's also more illuminating given we have options to scale up at the cost of performance; from a pure FLOPS perspective, this is negative progress, but pragmatically, this should push timelines closer.
I disagree with that post and its first two links so thoroughly that any direct reply or commentary on it would be more negative than I'd like to be on this site. (I do appreciate your comment, though, don't take this as discouragement for clarifying your position.) I don't want to leave it at that, so instead let me give a quick thought experiment.
A neuron's signal hop latency is about 5ms, and in that time light can travel about 1500km, a distance approximately equal to the radius of the moon. You could build a machine literally the size of the moon, floating in deep space, before the speed of light between the neurons became a problem relative to the chemical signals in biology, as long as no single neuron went more than half way through. Unlike today's silicon chips, a system like this would be restricted by the same latency propagation limits that the brain is, but still, it's the size of the moon. You could hook this moon-sized computer to a human-shaped shell on Earth, and as long as the computer was directly overhead, the human body could be as responsive and fully updatable as a real human.
While such a computer is obviously impractical on so many levels, I find it a good frame of reference to think about the characteristics of how computers scale upwards, much like Feynman's There's Plenty of Room at the Bottom was a good frame of reference for scaling down, considered back when transistors were still wired by hand. In particular, the speed of light is not a problem, and will never become one, except where it's a resource we use inefficiently.
Scaling Language Model Size by 1000x relative to GPT3. 1000x is pretty feasible, but we'll hit difficult hardware/communication bandwidth constraints beyond 1000x as I understand.
I think people are hugely underestimating how much room there is to scale.
The difficulty, as you mention, is bandwidth and communication, rather than cost per bit in isolation. An A100 manages 1.6TB/sec of bandwidth to its 40 GB of memory. We can handle sacrificing some of this speed, but something like SSDs aren't fast enough; 350 TB of SSD memory would cost just $40k, but would only manage 1-2 TB/s over the whole array, and could not push it to a single GPU. More DRAM on the GPU does hit physical scaling issues, and scaling out to larger clusters of GPUs does start to hit difficulties after a point.
This problem is not due to physical law, but the technologies in question. DRAM is fast, but has hit a scaling limit, whereas NAND scales well, but is much slower. And the larger the cluster of machines, the more bandwidth you have to sacrifice for signal integrity and routing.
Thing is, these are fixable issues if you allow for technology to shift. For example,
It seems plausible to me that a Manhattan Project could scale to models with a quintillion parameters, aka. 10,000,000x scaling, within 15 years, using only lightweight training sparsity. That's not to say it's necessarily feasible, but that I can't rule out technology allowing that level of scaling.
It might be possible to convince me on something like that, as it fixes the largest problem, and if Hanson is right that blackmail would significantly reduce issues like sexual harassment then it's at least worth consideration. I'm still disinclined towards the idea for other reasons (incentivizes false allegations, is low oversight, difficult to keep proportionality, can incentivize information hiding, seems complex to legislate), but I'm not sure how strong those reasons are.
I agree this makes a large fractional change to some AI timelines, and has significant impacts on questions like ownership. But when considering very short timescales, while I can see OpenAI halting their work would change ownership, presumably to some worse steward, I don't see the gap being large enough to materially affect alignment research. That is, it's better OpenAI gets it in 2024 than someone else gets it in 2026.
This constant seems to be very small, which is why compute had to drop all the way to ~$1k before any researchers worldwide were fanatical enough to bother trying CNNs and create AlexNet.
It's hard to be fanatical when you don't have results. Nowadays AI is so successful it's hard to imagine this being a significant impediment.
Excluding GShard (which as a sparse model is not at all comparable parameter-wise)
I wouldn't dismiss GShard altogether. The parameter counts aren't equal, but MoE(2048E, 60L) is still a beast, and it opens up room for more scaling than a standard model.
Robin Hanson argued that negative gossip is probably net positive for society.
Yes, this is what my post was addressing and the analogy was about. I consider it an interesting hypothesis, but not one that holds up to scrutiny.
Lying about someone in a damaging way is already covered by libel/slander laws.
I know, but this only further emphasizes how much better paying those who helped a conviction is. Blackmail is private, threat-based, and necessarily unpoliced, whereas the courts have oversight and are an at least somewhat impartial test for truth.
Gwern's claim is that these other institutions won't scale up as a consequence of believing the scaling hypothesis; that is, they won't bet on it as a path to AGI, and thus won't spend this money on abstract of philosophical grounds.
My point is that this only matters on short-term scales. None of these companies are blind to the obvious conclusion that bigger models are better. The difference between a hundred-trillion dollar payout and a hundred-million dollar payout is philosophical when you're talking about justifying <$5m investments. NVIDIA trained an 8.3 B parameter model as practically an afterthought. I get the impression Microsoft's 17 B parameter Turing-NLG was basically trained to test DeepSpeed. As markets open up to exploit the power of these larger models, the money spent on model scaling is going to continue to rise.
These companies aren't competing with OpenAI. They've built these incredibly powerful systems incidentally, because it's the obvious way to do better than everyone else. It's a tool they use for market competitiveness, not as a fundamental insight into the nature of intelligence. OpenAI's key differentiator is only that they view scale as integral and explanatory, rather than an incidental nuisance.
With this insight, OpenAI can make moonshots that the others can't: build a huge model, scale it up, and throw money at it. Without this understanding, others will only get there piecewise, scaling up one paper at a time. The delta between the two is at best a handful of years.
If OpenAI changed direction tomorrow, how long would that slow the progress to larger models? I can't see it lasting; the field of AI is already incessantly moving towards scale, and big models are better. Even in a counterfactual where OpenAI never started scaling models, is this really something that no other company can gradient descent on? Models were getting bigger without OpenAI, and the hardware to do it at scale is getting cheaper.