1a3orn

Wiki Contributions

Comments

For what it's worth, I'm at least somewhat an LLM-plateau-ist -- on balance at least somewhat dubious we get AGI from models in which 99% of compute is spent on next-word prediction in big LLMs. I really think Nostalgebrists take has merit and the last few months have made me think it has more merit. Yann LeCunn's "LLMs are an off-ramp to AGI" might come back to show his forsight. Etc etc.

But it isn't just LLM progress which has hinged on big quantities of compute. Everything in deep learning -- ResNets, vision Transformers, speech-to-text, text-to-speech, AlphaGo, EfficientZero, Dota5, VPT, and so on -- has used more and more compute. I think at least some of this deep learning stuff is an important step to human-like intelligence, which is why I think this is good evidence against Yudkowsky

If you think none of the DL stuff is a step, then you can indeed maintain the compute doesn't matter, of course, and that I am horribly wrong. But if you think the DL stuff is an important step, it becomes more difficult to maintain.

I will provide two estimates that both suggest it would be feasible to have at least ~1 million copies of a model learning in parallel at human speed. This corresponds to 2500 human-equivalent years of learning per day, since 1 million days = 2500 years.

This is potentially a little misleading, I think?

A human who learning does *active exploration". They seek out things that they don't know, and try to find blind spots. They loop around and try to connect pieces of their knowledge that were unconnected. They balance exploration and exploitation. 2500 years of this is a lot of time to dive deeply into individual topics, pursue them in depth, write new papers on them, talk with other people, connect them carefully, and so on.

An LLM learning from feedback on what it says doesn't do any of this. It isn't pursuing long-running threads over 2500-equivalent years, or seeking out blindspots in its knowledge -- it isn't trying to balance exploration and exploitation at all, because it's just trying to provide accurate answers to the questions given to it. There's even anti-blindspot feedback -- people are disproportionately going to ask LLMs about what people predict they'll do well at, rather than what they'll do poorly at! Which will limit the skills it picks up badly.

I don't know what that looks like in the limit. You could maintain that it's frighteningly smart, or still really stupid, or more likely both on different topics. But not sure human-equivalent years is going to give you a useful intuition for what this looks like at all. Like it's... some large amount of knowledge that is attainable from this info, but it isn't human-equivalent. Just a different kind of thing.

Not disagreeing. Am still interested in a longer-form view of why the 44x estimate overestimates, if you're interested in writing it (think you mentioned looking into it one time).

So, like, I remain pretty strongly pro Hanson on this point:

  1. I think LLaMA 7b is very cool, but it's really stretching it to call it a state-of-the-art language model. It's much worse than LLaMA 65b, which much worse than GPT-4, which most people think is > 100b as far as I know. I'm using a 12b model right now while working on an interpretability project... and it is just much, much dumber than these big ones.

  2. Not being able to train isn't a small deal, I think. Learning in a long-term way is a big part of intelligence.

  3. Overall, and not to be too glib, I don't see why fitting a static and subhuman mind into consumer hardware from 2023 means that Yudkowsky doesn't lose points for saying you can fit a learning (implied) and human-level mind into consumer hardware from 2008.

For any given period of time, the algorithmic progress is a bigger deal for increasing performance than the degree to which compute got cheaper in the same period.

This is true, but as a picture of a past, this is underselling compute by focusing on cost of compute rather than compute itself.

I.e., in the period between 2012 and 2020:

-- Algo efficiency improved 44x, if we use the OpenAI efficiency baseline for AlexNet

-- Cost of compute improved by... less than 44x, let's say, if we use a reasonable guess based off Moore's law. So algo efficiency was more important than that cost per FLOP going down.

-- But, using EpochAI's estimates for a 6 month doubling time, total compute per training run increased > 10,000x.

So just looking at cost of compute is somewhat misleading. Cost per FLOP went down, but the amount spent went up from just dollars on a training run to tens of thousands of dollars on a training run.

I agree I'm confused here. But it's hard to come down to clear interpretations. I kinda think Hanson and Yudkowsky are also confused.

Like, here are some possible interpretations on this issue, and how I'd position Hanson and Yudkowsky on them based on my recollection and on vibes.

  1. Improvements in our ways of making AI will be incremental. (Hanson pro, Yudkowsky maaaybe con, and we need some way to operationalize "incremental", so probably just ambiguous)
  2. Improvements in our ways of making AI will be made by lots of different people distributed over space and time. (Hanson pro, Yudkowsky maybe con, seems pretty Hanson favored)
  3. AI in its final form will have elegant architecture (Hanson more con, Yudkowsky more pro, seems Yudkowsky favored, but I'm unhappy with what "elegant" means)

Or even 4. People know when they're making a significant improvement to AI -- the difference between "clever hack" and "deep insight" is something you see from beforehand just as much as afterwards. (Hanson vibes con, Yudkowsky vibes pro, gotta read 1000 pages of philosophy of progress before you call it, maybe depends on the technology, I tend to think people often don't know)

Which is why this overall section is in the "hard to call" area.

In addition to what cfoster0 said, I'm kinda excited about the next ~2-3 years of cross LLM knowledge transfer, so this seems a differing prediction about the future, which is fun.

My model for why it hasn't happened already is in part just that most models know the same stuff, because they're trained on extremely similar enormous swathes of text, so there's no gain to be had by sticking them together. That would be why more effort goes into LLM / images / video glue than LLM / LLM glue.

But abstractly, a world where LLMs can meaningfully be connected to vision models but not on to other LLMs would be surprising to me. I expect something like training a model on code, and another model on non-code text, and then sticking them together to be possible.

So I think that

Implementing old methods more vigorously is more or less exactly what got modern deep learning started

is just straightforwardly true.

Everyone times the start of the deep learning... thing, to 2012's AlexNet. AlexNet has convolutions, and reLU, and backprop, but didn't invent any of them. Here's what Wikipedia says is important about AlexNet

AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012.[3] The network achieved a top-5 error of 15.3%, more than 10.8 percentage points lower than that of the runner up. The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of graphics processing units (GPUs) during training.[2]

So.... I think that what I'm saying about how DL started is the boring consensus. Of course, new algorithms did come along, and I agree that they are important. But still -- if there's something important that has worked without big compute, what is it?

(I do agree that in a counterfactual world I'd probably prefer to get Attention is All You Need.)

And yeah, I accidentally posted this a month ago for 30 min when it was in draft, so you might have seen it it before.

I do in fact include the same quote you include in the section titled "Cyc is not a Promising Approach to Machine Intelligence." That's part of the reason why that section resolves in favor of Yudkowsky.

I agree that Hanson thinks skills in general will be harder to acquire than Yudkowsky thinks. I think that could easily be another point for Yudkowsky in the "human content vs right architecture." Like many points there in that section, I don't think it's operationalized particularly well, which is why I don't call it either way.

Sure, but ~0 significant technologies have not caused some number of casualties.

Cars, bikes, planes, guns, steel, chemical knowledge, shipbuilding, the ability to create fire, telegrams, telephones, the internet, the mechanical loom, the steam engine -- all of them have been used to kill people intentionally, or just killed people by accident.

Load More