ryan_greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by

I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.

This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won't be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.

Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.

(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)

I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I

This is as of the point of full AI R&D automation? Or as of any point?

Sure, but note that the story "tariffs -> recession -> less AI investment" doesn't particularly depend on GPU tariffs!

Of course, tariffs could have more complex effects than just reducing GPUs purchased by 32%, but this seems like a good first guess.

Shouldn't a 32% increase in prices only make a modest difference to training FLOP? In particular, see the compute forecast. Between Dec 2026 and Dec 2027, compute increases by roughly an OOM and generally it looks like compute increases by a bit less than 1 OOM per year in the scenario. This implies that a 32% reduction only puts you behind by like 1-2 months.

  • I probably should have used a running example in this post - this just seems like a mostly unforced error.
  • I considered writing a conclusion, but decided not to because I wanted to spend the time on other things and I wasn't sure what I would say that was useful and not just a pure restatement of things from earlier. This post is mostly a high level framework + list of considerations, so it doesn't really have a small number of core points.
  • This post is a relatively low effort post as indicated by "Notes on", possibly I should have flagged this more.
  • I think comments / in person are easier to understand than my blog posts as I often try to write blog posts that have lots and lots of content which is all grouped together, but without a specific thesis. I typically have either 1 point or a small number of points in comments / in person. Also, it's easier to write in response to something as there is an assumed level of context already etc.

Is this an accurate summary:

  • 3.5 substantially improved performance for your use case and 3.6 slightly improved performance.
  • The o-series models didn't improve performance on your task. (And presumably 3.7 didn't improve perf.)

So, by "recent model progress feels mostly like bullshit" I think you basically just mean "reasoning models didn't improve performance on my application and Claude 3.5/3.6 sonnet is still best". Is this right?

I don't find this state of affairs that surprising:

  • Without specialized scaffolding o1 is quite a bad agent and it seems plausible your use case is mostly blocked on this. Even with specialized scaffolding, it's pretty marginal. (This shows up in the benchmarks AFAICT, e.g., see METR's results.)
  • o3-mini is generally a worse agent than o1 (aside from being cheaper). o3 might be a decent amount better than o1, but it isn't released.
  • Generally Anthropic models are better for real world coding and agentic tasks relative to other models and this mostly shows up in the benchmarks. (Anthropic models tend to slightly overperform their benchmarks relative to other models I think, but they also perform quite well on coding and agentic SWE benchmarks.)
  • I would have guessed you'd see performance gains with 3.7 after coaxing it a bit. (My low confidence understanding is that this model is actually better, but it is also more misaligned and reward hacky in ways that make it less useful.)

METR has found that substantially different scaffolding is most effective for o-series models. I get the sense that they weren't optimized for being effective multi-turn agents. At least, the o1 series wasn't optimized for this, I think o3 may have been.

Sorry if my comment was triggering @nostalgebraist. : (

Load More