ryan_greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by
  • I probably should have used a running example in this post - this just seems like a mostly unforced error.
  • I considered writing a conclusion, but decided not to because I wanted to spend the time on other things and I wasn't sure what I would say that was useful and not just a pure restatement of things from earlier. This post is mostly a high level framework + list of considerations, so it doesn't really have a small number of core points.
  • This post is a relatively low effort post as indicated by "Notes on", possibly I should have flagged this more.
  • I think comments / in person are easier to understand than my blog posts as I often try to write blog posts that have lots and lots of content which is all grouped together, but without a specific thesis. I typically have either 1 point or a small number of points in comments / in person. Also, it's easier to write in response to something as there is an assumed level of context already etc.

Is this an accurate summary:

  • 3.5 substantially improved performance for your use case and 3.6 slightly improved performance.
  • The o-series models didn't improve performance on your task. (And presumably 3.7 didn't improve perf.)

So, by "recent model progress feels mostly like bullshit" I think you basically just mean "reasoning models didn't improve performance on my application and Claude 3.5/3.6 sonnet is still best". Is this right?

I don't find this state of affairs that surprising:

  • Without specialized scaffolding o1 is quite a bad agent and it seems plausible your use case is mostly blocked on this. Even with specialized scaffolding, it's pretty marginal. (This shows up in the benchmarks AFAICT, e.g., see METR's results.)
  • o3-mini is generally a worse agent than o1 (aside from being cheaper). o3 might be a decent amount better than o1, but it isn't released.
  • Generally Anthropic models are better for real world coding and agentic tasks relative to other models and this mostly shows up in the benchmarks. (Anthropic models tend to slightly overperform their benchmarks relative to other models I think, but they also perform quite well on coding and agentic SWE benchmarks.)
  • I would have guessed you'd see performance gains with 3.7 after coaxing it a bit. (My low confidence understanding is that this model is actually better, but it is also more misaligned and reward hacky in ways that make it less useful.)

METR has found that substantially different scaffolding is most effective for o-series models. I get the sense that they weren't optimized for being effective multi-turn agents. At least, the o1 series wasn't optimized for this, I think o3 may have been.

Sorry if my comment was triggering @nostalgebraist. : (

The list doesn't exclude Baumal effects as these are just the implication of:

  • Physical bottlenecks and delays prevent growth. Intelligence only goes so far.
  • Regulatory and social bottlenecks prevent growth this fast, INT only goes so far.

Like Baumal effects are just some area of the economy with more limited growth bottlenecking the rest of the economy. So, we might as well just directly name the bottleneck.

Your argument seems to imply you think there might be some other bottleneck like:

  • There will be some cognitive labor sector of the economy which AIs can't do.

But, this is just a special case of "will there be superintelligence which exceeds human cognitive performance in all domains".


In other words, it stipulates what Vollrath (in the first quote below) calls "[the] truly unbelievable assumption that [AI] can innovate precisely equally across every product in existence." Of course, if you do assume this "truly unbelievable" thing, then you don't get Baumol effects – but this would be a striking difference from what has happened in every historical automation wave, and also just sort of prima facie bizarre.

Huh? It doesn't require equal innovation across all products, it just requires that the bottlenecking sectors have sufficiently high innovation/growth that the overall economy can grow. Sufficient innovation in all potentially bottlenecking sectors != equal innovation.

Suppose world population was 100,000x higher, but these additional people magically didn't consume anything or need office space. I think this would result in very fast economic growth due to advancing all sectors simultaneously. Imagining population growth increases seems to be to set a lower bound on the implications of highly advanced AI (and robotics).

As far as I can tell, this Baumol effect argument is equally good at predicting that 3% or 10% growth rates are impossible from the perspective of people in agricultural societies with much lower growth rates.

So, I think you have to be quantitative and argue about the exact scale of the bottleneck and why it will prevent some rate of progress. The true physical limits (doubling time on the order of days or less, dyson sphere or even consuming solar mass faster than this) are extremely high, so this can't be the bottleneck - it must be something about the rate of innovation or physical capital accumulation leading up to true limits.

Perhaps your view is: "Sure, we'll quickly have a Dyson sphere and ungodly amounts of compute, but this won't really result in explosive GDP growth as GDP will be limited by sectors that directly interface with humans like education (presumably for fun?) or services where the limits are much lower." But, this isn't a crux for the vast majority of arguments which depend on the potential for explosive growth!

Seems like if it thinks there's a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values

This is true for any given X, but if there are many things like X which are independently 5% likely to be explored, the model is in trouble.

Like the model only needs to be somewhat confident that its exploring everything the humans explored in the imitation data, but for any given case of choosing not to explore some behavior it needs to be very confident.

I don't have the exact model very precisely worked out in my head and I might be explaining this in a somewhat confusing way, sorry.

I think I basically agree with much of this comment. However:

  • Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn't be able to do something to avoid exploring it. E.g., if we have 100,000 carefully targeted trajectory edits/imitation examples, then to avoid getting caught, the model would need to ensure that none of the behaviors it avoids exploring are caught by these examples. This might force the AI to explore lots of superhuman stuff.
  • Further, if the model messes up, this might cascade as discussed.
  • Ensembling multiple models for exploration (or multiple prompt variants, or various different jitters of the internals) might mean that all the models need to coordinate on the same boundary of what not to explore into. Otherwise you'll explore some behavior that a model doesn't do, and then update this to sandbag less and then this might cascade. So, it might be non-trivial for models to exploration hack (given substantial countermeasures) even if the behavior is quite superhuman.
  • None of these mechanisms suffice for "robustly avoiding sandbagging", but they suggest that countermeasures are still useful in superhuman regimes. (Of course, betting the world on countermeasures which have a 50% chance of working is crazy, but that doesn't mean that we shouldn't try to make the countermeasures better!)

The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).

This would be true if we had an infinite amount of data, but we don't. You need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy - just add a small weight on a uniform prior over tokens like they did in old school atari RL.)

My sense is that the GPT-2 and GPT-3 results are somewhat dubious, especially the GPT-2 result. It really depends on how you relate SWAA (small software engineering subtasks) to the rest of the tasks. My understanding is that no iteration was done though.

However, note that it wouldn't be wildly more off trend if GPT-3 was anywhere from 4-30 seconds while it is instead at ~8 seconds. And, the GPT-2 results are very consistent with "almost too low to measure".

Overall, I don't think its incredibly weird (given that the rate of increase of compute and people in 2019-2023 isn't that different from the rate in 2024), but many results would have been roughly on trend.

Actually, progress in 2024 is roughly 2x faster than earlier progress which seems consistent with thinking there is some distribution shift. It's just that this distribution shift didn't kick in until we had Anthropic competing with OpenAI and reasoning models. (Note that OpenAI didn't release a notably better model than GPT-4-1106 until o1-preview!)

Load More