p.b.'s Shortform

p.b.

LESSWRONG
LW

p.b.'s Shortform — LessWrong

p.b.'s Shortform

by p.b.

28th Jun 2024

1 min read

6

This is a special post for quick takes by p.b.. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:14 AM

[-]p.b.1y198

My bear case for Nvidia goes like this:

I see three non-exclusive scenarios where Nvidia stops playing the important role in AI training and inference that it used to play in the past 10 years:

China invades or blockades Taiwan. Metaculus gives around 25% for an invasion in the next 5 years.
All major players switch to their own chips. Like Google has already done, Amazon is in the process of doing, Microsoft and Meta have started doing and even OpenAI seems to be planning.
Nvidias moats fail. CUDA is replicated for cheaper hardware, ASICs or stuff like Cerebras start dominating inference, etc.

All these become much more likely than the current baseline (whatever that is) in the case of AI scaling quickly and generating significant value.

[-]Petropolitan1y41

The third scenario doesn't actually require any replication of CUDA: if Amazon, Apple, AMD and other companies making ASICs commoditize inference but Nvidia retains its moat in training, with inference scaling and algorithmic efficiency improvements the training will inevitably become a much smaller portion of the market

[-]Looked At To Win1y30

Another point on your last sentence: in a near or post AGI world one might think that the value of the type of knowledge work (pure design as opposed to manufacturing) Nvidia does might start trending towards zero as it becomes easier for anyone with equal compute access to replicate. Not sure if it will be possible to maintain a moat on the basis of quality in software/hardware design in such a world.

[-]Jonas Hallgren1y10

I guess the entire "we need to build an AI internally" US narrative will also increase the likelyhood of Taiwan being invaded from China for data chips?

Good that we all have the situational awareness to not summon any bad memetics into the mindspace of people :D

[-]p.b.2y120

No one really knew why tokamaks were able to achieve such impressive results. The Soviets didn’t progress by building out detailed theory, but by simply following what seemed to work without understanding why. Rather than a detailed model of the underlying behavior of the plasma, progress on fusion began to take place by the application of “scaling laws,” empirical relationships between the size and shape of a tokamak and various measures of performance. Larger tokamaks performed better: the larger the tokamak, the larger the cloud of plasma, and the longer it would take a particle within that cloud to diffuse outside of containment. Double the radius of the tokamak, and confinement time might increase by a factor of four. With so many tokamaks of different configurations under construction, the contours of these scaling laws could be explored in depth: how they varied with shape, or magnetic field strength, or any other number of variables.

Hadn't come across this analogy to current LLMs. Source: This interesting article.

[-]Vladimir_Nesov2y20

Nice! And the "scaling laws" terminology in this sense goes way back:

JW Connor, JB Taylor (1977) Scaling Laws for Plasma Confinement

[-]p.b.4mo110

I computed METR time horizons for SWE bench verified sota models using both the existing difficulty estimates and work time estimates derived from commit data.

I used a range of different methods including the original METR methodology where task level success info was available.

I did this for 4 different rankings, EpochAI's, LLMStats's and the "verified" and "bash only" rankings of the SWE benchmark website.

In every single case the trend fits a logistic function with an asymptote of a couple of hours better than an exponential. In some cases the trend only becomes logistic with the last one or two datapoints, so it's not surprising that the METR report has an exponential fit for SWE bench.

I am not sure when I get around to publishing this analysis, because it's a giant mess of different datasets and methods. But I thought I at least state the result before it becomes irrelevant, falsified or obvious.

[-]ryan_greenblatt4mo106

Wouldn't you expect this if we're close to saturating SWE bench (and some of the tasks are impossible)? Like, you eventually cap out at the max performance for swe bench and this doesn't correspond to an infinite time horizon on literally swe bench (you need to include more longer tasks).

[-]p.b.4mo30

SWE bench verified shouldn't have that many impossible tasks if any, right? And the highest scores for the rankings I used are still significantly below 80%. But it's possible. Maybe a good motivation to look at SWE bench pro.

[-]ryan_greenblatt4mo20

I'd guess swe bench verified has an error rate around 5% or 10%. They didn't have humans baseline the tasks, just look at them and see if they seem possible.

Wouldn't you expect thing to look logistic substantially before full saturation?

[-]Jsevillamol4mo60

Side note: we find evidence of an error rate for SWE bench verified between 5 and 10% in our benchmark review.

https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate

[-]p.b.4mo40

I fitted logistic functions and gaussian cdfs with a factor to the trend of the percentage scores for the four rankings I analysed and they all asymptote below 80%. The idea was to find some evidence for an "irreducible error".

But given that 20+% error rate is clearly way too high, it still makes more sense to me to argue that the improvement is slowing and therefore these fits asymptote too low, than to argue that the time horizons and percentages are asymptoting because of a high percentage of unsolvable tasks.

But this gave me a more general idea of assessing changes in improvement speed: The default assumption right now should be that model improvement moves linearly through the log of the time horizon space. Additionally, I found that at least SWE-bench verified seems to have task lengths that are lognormally distributed and I suspect that holds for many benchmarks.

This means that the way to saturation should follow a gaussian cdf. Now the idea would be that one can use the movement through the first x percent of the benchmark to fit the gaussian cdf (or at least sanity check that assumption) and then see whether the model slows down for the rest of the benchmark. To put it differently: Constant improvement speed -> Symmetric underlying gaussian of the cdf. Slowdown -> Right tail gets fatter.

Of course the signal would be pretty weak, but if one would aggregate this over several benchmarks, it might make a good speedometer.

[-]Noosphere894mo20

Conditional on a slowdown in AI progress, my primary hypothesis is that the problem is that recent AI models haven't scaled much in compute compared to past models and have relied on RL progress, and current RL is becoming less and less of a free lunch than before and is actually less efficient than pre-training.

Which is a slight update against software-only singularity stories occurring.

[-]p.b.4mo20

It depends how the work times of these unsolvable tasks are distributed, you could in principle get any outcome. But there are a few ways to check for the existence of unsolvable tasks, maybe I'll find the time today.

[-]p.b.4mo20

Hmm, actually all these checks can't distinguish between actually unsolvable tasks and tasks that are unsolvable for further scaled up models of the current kind (with the framework and compute used in the evaluations).

[-]Thomas Kwa4mo20

I wouldn't take one or two datapoints on a single benchmark too seriously, especially with a methodology as fiddly as time horizon and concerns like Ryan's. Nevertheless seems like a good thought that you replicated using time estimates from commit data, as the original difficulty estimates seemed likely to be noisy. I'll be interested to see if the trend continues and if the same is currently true with OSWorld (Looks like they had a big update so maybe it's possible to get individual task data now.)

[-]p.b.4mo20

Yeah, I am also pretty much on the fence right now. But time will tell.

[-]p.b.10mo60

I originally thought that the METR results meant that this or next year might be the year where AI coding agents had their breakthrough moment. The reasoning behind this was that if the trend holds AI coding agents will be able to do several hour long tasks with a certain probability of success, which would make the overhead and cost of using the agent suddenly very economically viable.

I now realised that this argument has a big hole: All the METR tasks are timed for un-aided humans, i.e. humans without the help of LLMs. This means that especially for those tasks that can be successfully completed by AI coding agents, the actual time a human aided by LLMs would need is much shorter.

I'm not sure how many task completion time doublings this buys before AI coding agents take over a large part of coding, but the farther we extrapolate from the existing data points the higher the uncertainty that the trend will hold.

Estimating task completion times for AI-aided humans would have been an interesting addition to the study. The correlation of the time-savings through AI-support with the task completion probability by AI coding agents might have allowed the prediction of the actual economic competitiveness of AI coding agents in the near future.

Moderation Log

More from p.b.

Curated and popular this week

18Comments

18 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:14 AM

[-]p.b.1y198

My bear case for Nvidia goes like this:

I see three non-exclusive scenarios where Nvidia stops playing the important role in AI training and inference that it used to play in the past 10 years:

China invades or blockades Taiwan. Metaculus gives around 25% for an invasion in the next 5 years.
All major players switch to their own chips. Like Google has already done, Amazon is in the process of doing, Microsoft and Meta have started doing and even OpenAI seems to be planning.
Nvidias moats fail. CUDA is replicated for cheaper hardware, ASICs or stuff like Cerebras start dominating inference, etc.

All these become much more likely than the current baseline (whatever that is) in the case of AI scaling quickly and generating significant value.

[-]Petropolitan1y41

[-]Looked At To Win1y30

[-]Jonas Hallgren1y10

I guess the entire "we need to build an AI internally" US narrative will also increase the likelyhood of Taiwan being invaded from China for data chips?

Good that we all have the situational awareness to not summon any bad memetics into the mindspace of people :D

[-]p.b.2y120

No one really knew why tokamaks were able to achieve such impressive results. The Soviets didn’t progress by building out detailed theory, but by simply following what seemed to work without understanding why. Rather than a detailed model of the underlying behavior of the plasma, progress on fusion began to take place by the application of “scaling laws,” empirical relationships between the size and shape of a tokamak and various measures of performance. Larger tokamaks performed better: the larger the tokamak, the larger the cloud of plasma, and the longer it would take a particle within that cloud to diffuse outside of containment. Double the radius of the tokamak, and confinement time might increase by a factor of four. With so many tokamaks of different configurations under construction, the contours of these scaling laws could be explored in depth: how they varied with shape, or magnetic field strength, or any other number of variables.

Hadn't come across this analogy to current LLMs. Source: This interesting article.

[-]Vladimir_Nesov2y20

Nice! And the "scaling laws" terminology in this sense goes way back:

JW Connor, JB Taylor (1977) Scaling Laws for Plasma Confinement

[-]p.b.4mo110

I computed METR time horizons for SWE bench verified sota models using both the existing difficulty estimates and work time estimates derived from commit data.

I used a range of different methods including the original METR methodology where task level success info was available.

I did this for 4 different rankings, EpochAI's, LLMStats's and the "verified" and "bash only" rankings of the SWE benchmark website.

[-]ryan_greenblatt4mo106

[-]p.b.4mo30

[-]ryan_greenblatt4mo20

I'd guess swe bench verified has an error rate around 5% or 10%. They didn't have humans baseline the tasks, just look at them and see if they seem possible.

Wouldn't you expect thing to look logistic substantially before full saturation?