we’ll be releasing Claude 3.5 Haiku and Claude 3.5 Opus later this year.

They made a mini model card. Notably:

The UK AISI also conducted pre-deployment testing of a near-final model, and shared their results with the US AI Safety Institute . . . . Additionally, METR did an initial exploration of the model’s autonomy-relevant capabilities.

It seems that UK AISI only got maximally shallow access, since Anthropic would have said if not, and in particular the model card mentions "internal research techniques to acquire non-refusal model responses" as internal. This is better than nothing, but it would be unsurprising if an evaluator with shallow access is unable to elicit dangerous capabilities but users—with much more time and with access to future elicitation techniques—ultimately are. Recall that DeepMind, in contrast, gave "external testing groups . . . . the ability to turn down or turn off safety filters."

Anthropic CEO Dario Amodei gave Dustin Moskovitz the impression that Anthropic committed "to not meaningfully advance the frontier with a launch." (Plus Gwern, and this was definitely Anthropic's vibe around 2022,[1] although not a hard public commitment.) Perhaps Anthropic does not consider itself bound by this, which might be reasonable — it's quite disappointing that Anthropic hasn't clarified its commitments, particularly after the confusion on this topic around the Claude 3 launch.

  1. ^

    E.g. Zac Hatfield-Dodds said Anthropic wanted to avoid "making things worse (advancing capabilities, race dynamics, etc)." Plus nonpublic sources, including stuff discussed in the Claude 3 discourse.

Mentioned in
New Comment
41 comments, sorted by Click to highlight new comments since:

IMO if any lab makes some kind of statement or commitment, you should treat this as "we think right now that we'll want to do this in the future unless it's hard or costly", unless you can actually see how you would sue them or cause a regulator to fine them if they violate the commitment. This doesn't mean weaker statements have no value.

Also you would need clarity on how to measure the commitment.

I'm disappointed that there weren't any non-capability metrics reported. IMO it would be good if companies could at least partly race and market on reliability metrics like "not hallucinating" and "not being easy to jailbreak".

Edit: As pointed out in reply, addendum contains metrics on refusals which show progress, yay! Broader point still stands, I wish there were more measurements and they were more prominent.

Their addendum contains measurements on refusals and harmlessness, though these aren't that meaningful and weren't advertised.

If anyone wants to work on this, there's a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench


Agree. I think Google DeepMind might actually be the most forthcoming about this kind of thing, e.g., see their Evaluating Frontier Models for Dangerous Capabilities report.

I thought that paper was just dangerous-capability evals, not safety-related metrics like adversarial robustness.


A thing I'd really like to exist is a good auto-interpretability benchmark, e.g. that asks the model about interpreting GPT-2 neurons given max activating examples.

Would be nice, but I was thinking of metrics that require "we've done the hard work of understanding our models and making them more reliable", better neuron explanation seems more like it's another smartness test.


Yeah, I agree it's largely smartness, and I agree that it'd also be nice to have more non-smartness benchmarks—but I think an auto-interp-based thing would be a substantial improvement over current smartness benchmarks.

Maybe we should make fake datasets for this? Neurons often aren't that interpretable and we're still confused about SAE features a lot of the time. It would be nice to distinguish "can do autointerp | interpretable generating function of complexity x" from "can do autointerp". 

Can you say more about why you would want this to exist? Is it just that "do auto-interpretability well" is a close proxy for "model could be used to help with safety research"? Or are you also thinking about deception / sandbagging, or other considerations. 

Funnily enough, Nvidia's recent 340B parameter chat assistant release did boast about being number one on the reward model leaderboard, however, the reward model only claims to capture helpfulness and a bunch of other metrics of usefulness to the individual user. But that's still pretty good.

It's quite possible that anthropic has some internal definition of "not meaningfully advancing the capabilities frontier" that is compatible with this release. But imo they shouldn't get any credit unless they explain it.


I explicitly asked Anthropic whether they had a policy of not releasing models significantly beyond the state of the art. They said no, and that they believed Claude 3 was noticeably beyond the state of the art at the time of its release. 

And to elaborate a little bit (based on my own understanding, not what they told me) their RSP sort of says the opposite. To avoid a "race to the bottom" they base the decision to deploy a model on what harm it can cause, regardless of what models other companies have released. So if someone else releases a model with potentially dangerous capabilities, Anthropic can't/won't use that as cover to release something similar that they wouldn't have released otherwise. I'm not certain whether this is the best approach, but I do think it's coherent.


Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.


I can definitely imagine them plausibly believing they're sticking to that commitment, especially with a sprinkle of motivated reasoning. It's "only" incremental nudging the publicly available SOTA rather than bigger steps like GPT2 --> GPT3 --> GPT4.

VN: Write a program that by default prints a number from the Fibonacci sequence, and on input --next prints a similar program that by default prints the next number in the sequence, and so on.

Claude 3.5 Sonnet: Here's a Python program that meets your requirements:

import sys

# Fibonacci sequence values
a, b = 0, 1

# The current Fibonacci number to print
current = a

# Program content template
program_template = '''import sys

# Fibonacci sequence values
a, b = {}, {}

# The current Fibonacci number to print
current = a

# Program content template
program_template = {!r}

def main():
    if len(sys.argv) > 1 and sys.argv[1] == "--next":
        # Generate the next program
        next_program = program_template.format(b, a + b, program_template)
        # Print the current Fibonacci number

if __name__ == "__main__":

def main():
    if len(sys.argv) > 1 and sys.argv[1] == "--next":
        # Generate the next program
        next_program = program_template.format(b, a + b, program_template)
        # Print the current Fibonacci number

if __name__ == "__main__":

This program does the following:

  1. It initializes the first two Fibonacci numbers (a and b) and sets the current number to be printed.
  2. It defines a program_template which is a string representation of the program itself, with placeholders for the Fibonacci values.
  3. The main() function checks if the --next argument is provided:
    • If --next is provided, it generates the next program by formatting the program_template with the next Fibonacci values.
    • If no argument is provided, it simply prints the current Fibonacci number.
  4. The program runs the main() function when executed.

Here's how you can use this program:

  1. Save it to a file, let's say fibonacci_generator.py.
  2. Run it without arguments to print the current Fibonacci number:
python fibonacci_generator.py

This will print 0.

  1. Run it with the --next argument to generate the next program:
python fibonacci_generator.py --next > next_fibonacci.py

This will create a new file next_fibonacci.py with the next program.

  1. You can then run next_fibonacci.py to get the next Fibonacci number, or use --next again to generate the next program in the sequence.

Each generated program will print the next number in the Fibonacci sequence by default, and can generate a program for the subsequent number when run with --next.


It's pretty good. I tried it on a few mathematical questions.

First of all, a version of the standard AIW problem from the recent "Alice in Wonderland" paper. It got this right (not very surprisingly as other leading models also do, at least much of the time). Then a version of the "AIW+" problem which is much more confusing. Its answer was wrong, but its method (which it explained) was pretty much OK and I am not sure it was any wronger than I would be on average trying to answer that question in real time.

Then some more conceptual mathematical puzzles. I took them from recent videos on Michael Penn's YouTube channel. (His videos are commonly about undergraduate or easyish-olympiad-style pure mathematics. They seem unlikely to be in Claude's training data, though of course other things containing the same problems might be.)

One pretty straightforward one: how many distinct factorials can you find that all end in the same number of zeros? It wrote down the correct formula for the number of zeros, then started enumerating particular numbers and got some things wrong, tried to do pattern-spotting, and gave a hilariously wrong answer; when gently nudged, it corrected itself kinda-adequately and gave an almost-correct answer (which it corrected properly when nudged again) but I didn't get much feeling of real understanding.

Another (an exercise from Knuth's TAOCP; he rates its difficulty HM22, meaning it needs higher mathematics and should take you 25 minutes or so; it's about the relationship between two functions whose Taylor series coefficients differ by a factor H(n), the n'th harmonic number) it solved straight off and quite neatly.

Another (find all functions with (f(x)-f(y))/(x-y) = f'((x+y)/2) for all distinct x,y) it initially "solved" with a solution with a completely invalid step. When I said I couldn't follow that step, it gave a fairly neat solution that works if you assume f is real-analytic (has a Taylor series expansion everywhere). This is also the first thing that occurred to me when I thought about the problem. When asked for a solution that doesn't make that assumption, it unfortunately gave another invalid solution, and when prodded about that it gave another invalid one. Further prompting, even giving it a pretty big hint in the direction of a nice neat solution (better than Penn's :-)), didn't manage to produce a genuinely correct solution.

I rate it "not terribly good undergraduate at a good university", I think, but -- as with all these models to date -- with tragically little "self-awareness", in the sense that it'll give a wrong answer, and you'll poke it, and it'll apologize effusively and give another wrong answer, and you can repeat this several times without making it change its approach or say "sorry, it seems I'm just not smart enough to solve this one" or anything.

On the one hand, the fact that we have AI systems that can do mathematics about as well as a not-very-good undergraduate (and quite a bit faster) is fantastically impressive. On the other hand, it really does feel as if something fairly fundamental is missing. If I were teaching an actual undergraduate whose answers were like Claude's, I'd worry that there was something wrong with their brain that somehow had left them kinda able to do mathematics. I wouldn't bet heavily that just continuing down the current path won't get us to "genuinely smart people really thinking hard with actual world models" levels of intelligence in the nearish future, but I think that's still the way I'd bet.

(Of course a system that's at the "not very good undergraduate" level in everything, which I'm guessing is roughly what this is, is substantially superhuman in some important respects. And I don't intend to imply that it doesn't matter whether Anthropic are lax about what they release just because the latest thing happens not to be smart enough to be particularly dangerous.)

Capability of a chatbot to understand when extensively coached seems to indicate what the next generation will be able to do on its own, and elicitation of this capability is probably less sensitive to details of post-training than seeing what it can do zero-shot or with only oblique nudging. The quine puzzle I posted could only be explained to the strongest preceding models, which were unable to solve it on their own, and can't be explained to even weaker models at all.

So for long-horizon task capabilities, I'm placing some weight on checking if chatbots start understanding unusually patient and detailed in-context instruction on applying general planning or problem-solving skills to particular examples. They seem to be getting slightly better.


That seems reasonable.

My impression (which isn't based on extensive knowledge, so I'm happy to be corrected) is that the models have got better at lots of individual tasks but the shape of their behaviour when faced with a task that's a bit too hard for them hasn't changed much: they offer an answer some part of which is nonsense; you query this bit; they say "I'm sorry, I was wrong" and offer a new answer some different part of which is nonsense; you query this bit; they say "I'm sorry, I was wrong" and offer a new answer some different part of which is nonsense; rinse and repeat.

So far, that pattern doesn't seem to have changed much as the models have got better. You need to ask harder questions to make it happen, because they've got better at the various tasks, but once the questions get hard enough that they don't really understand, back comes the "I'm sorry, I was wrong" cycle pretty much the same as it ever was.

That's what something being impossible to explain looks like, the whack-a-mole pattern of correcting one problem only to get another, and the process never converges on correct understanding. As models improve, things that were borderline possible to explain start working without a need for explanation.

For long-horizon tasks, things that would need to be possible to explain are general reasoning skills (as in How to Solve It, or what it means for something to be an actual proof). The whack-a-mole level of failure would need to go away on questions of validity of reasoning steps or appropriateness of choice of the next step of a plan. The analogy suggests that first it would become possible to explain and discuss these issues, at the level of general skills themselves rather than of the object-level issue that the skills are being applied to. And then another step of scaling would enable a model to do a reasonable job of wielding such skills on its own.

There is an ambiguity here, between whack-a-mole on an object level question, and for example whack-a-mole on explaining to the chatbot the whack-a-mole pattern itself. Even if the pattern remains the same as the feasible difficulty of the object level questions increases for better models, at some point the pattern itself can become such an object level question that's no longer impossible to explain.


I'm suggesting that the fact that things the model can't do produce this sort of whack-a-mole behaviour and that the shape of that behaviour hasn't really changed as the models have grown better at individual tasks may indicate something fundamental that's missing from all models in this class, and that might not go away until some new fundamental insight comes along: more "steps of scaling" might not do the trick.

Of course it might not matter, if the models become able to do more and more difficult things until they can do everything humans can do, in which case we might not be able to tell whether the whack-a-mole failure mode is still there. My highly unreliable intuition says that the whack-a-mole failure mode is related to the planning and "general reasoning" lacunae you mention, and that those might turn out also to be things that models of this kind don't get good at just by being scaled further.

But I'm aware that people saying "these models will never be able to do X" tend to find themselves a little embarrassed when two weeks later someone finds a way to get the models to do X. :-) And, for the avoidance of doubt, I am not saying anything even slightly like "mere computers will never be truly able to think"; only that it seems like there may be a hole in what the class of models that have so far proved most capable can be taught to do, and that we may need new ideas rather than just more "steps of scaling" to fill those holes.

My point was that whack-a-mole behavior is both a thing that the models are doing, and an object level idea that models might be able to understand to a certain extent, an idea playing the same role as a fibonacci quine (except fibonacci quines are less important, they don't come up in every third request to a model). As a phenomenon, whack-a-mole or fibonacci quine is something we can try to explain to a model. And there are three stages of understanding: inability to hold the idea in one's mind at all, ability to hold it after extensive in-context tutoring, and ability to manipulate it without a need for tutoring. Discussing something that should work without a need for discussing it (like avoidance of listless whack-a-mole) is a window into representations a model has, which is the same thing that's needed for it to work without a need for discussing it.

At the stage of complete incomprehension, fibonacci quine looks like nonsense that remains nonsense after each correction, even if it becomes superficially better in one particular respect that the last correction pointed to. This could go on for many generations of models without visible change.

Then at some point it does change, and we arrive at the stage of coached understanding, like with Claude 3 Opus, where asking for a fibonacci quine results in code that has an exponential-time procedure for computing n-th fibonacci number, uses backslashes liberally and tries to cheat by opening files. But then you point out the issues and bugs, and after 15 rounds of back-and-forth it settles into something reasonable. Absolutely not worth it in practice, but demonstrates that the model is borderline capable of working with the idea. And the immediately following generation of models has Claude 3.5 Sonnet, arriving at the stage of dawning fluency, where its response looks like this (though not yet very robustly).

With whack-a-mole, we are still only getting into the second stage, the current models are starting to become barely capable of noticing that they are falling for this pattern, and only if you point it out to them (as opposed to giving an "it doesn't look like anything to me" impression even after you do point it out). They won't be able to climb out of the pattern unless you give specific instructions for what to do instead of following it. They still fail and need another reminder, and so on. Sometimes it helps with solving the original problem, but only rarely, and it's never worth it if the goal was to solve the problem.

Models can remain between the first and the second stage for many generations, without visible change, which is what you point out in case of whack-a-mole. But once we are solidly in the second stage for general problem solving and planning skills, I expect the immediately following generation of models to start intermittently getting into the third stage, failing gracefully and spontaneously pulling their own train of thought sideways in constructive ways. Which would mean that if you leave them running for millions of tokens, they might waste 95% on silly and repetitive trains of thought, but they would still be eventually making much more progress than weaker models that couldn't course-correct at all.


If it's true that models are "starting to become barely capable of noticing that they are falling for this pattern" then I agree it's a good sign (assuming that we want the models to become capable of "general intelligence", of course, which we might not). I hadn't noticed any such change, but if you tell me you've seen it I'll believe you and accordingly reduce my level of belief that there's a really fundamental hole here.

It's necessary to point it out to the model to see whether it might be able to understand, it doesn't visibly happen on its own, and it's hard to judge how well the model understands what's happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I'm following is to start discussing general reasoning skills that the model is failing at when it repeatedly can't make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I'm explaining.

I'd say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don't expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.

Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement.


While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold at which we will run the full evaluation protocol described in our Responsible Scaling Policy (RSP).

Hmmm, maybe the 4x effective compute threshold is too large given that you're getting near doubling of agentic task performance (on what I think is an eval with particularly good validity) but not hitting the threshold. 

Or maybe at the very least you should make some falsifiable predictions that might cause you to change this threshold. e.g., "If we train a model that has downstream performance (on any of some DC evals) ≥10% higher than was predicted by our primary prediction metric, we will revisit our prediction model and evaluation threshold." 

It is unknown to me whether Sonnet 3.5's performance on this agentic coding evaluation was predicted in advance at Anthropic. It seems wild to me that you can double your performance on a high validity ARA-relevant evaluation without triggering the "must evaluate" threshold; I think evaluation should probably be required in that case, and therefore, if I had written the 4x threshold, I would be reducing it. But maybe those who wrote the threshold were totally game for these sorts of capability jumps? 

I think more to the point is that when deviating from Chinchilla optimality, measuring effective compute becomes misleading, you can span larger increases in effective compute for Chinchilla optimal models by doing a detour through overtrained models. And given the price difference, Claude 3.5 Sonnet is likely more overtrained than Claude 3 Opus.

Let's say we start with a Chinchilla optimal model with active parameters that trains for tokens using FLOPs of compute. We can then train another model with active parameters for tokens using FLOPs of compute, and get approximately the same performance as with the previous model, but we've now made use of 3 times more compute, below the RSP's 4x threshold. Then, we train the next Chinchilla optimal model with active parameters for tokens using FLOPs of compute, an increase by another 3 times, also below the 4x threshold. But only this second step to the new Chinchilla optimal model increases capabilities, and it uses 9x more compute than the previous Chinchilla optimal model.

It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.

That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).

I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.

This is a reasonable formulation of what "effective compute" could be defined to mean, but is it actually used in this sense in practice, and who uses it like that? Is it plausible it was used when Anthropic was making the claim that "While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold" that compares a more Chinchilla optimal model to a more overtrained model?

It's an interesting thought, I didn't consider that this sense of "effective compute" could be the intended meaning. I was more thinking about having a compute multiplier measured from perplexity/FLOPs plots of optimal training runs that compare architectures, like in Figure 4 of the Mamba paper, where we can see that Transformer++ (RMSNorm/SwiGLU/etc.) needs about 5 times less compute (2 times less data) than vanilla Transformer to get the same perplexity, so you just multiply physical compute by 5 to find effective compute of Transformer++ with respect to vanilla Transformer. (With this sense of "effective compute", my argument in the grandparent comment remains the same for effective compute as it is for physical compute.)

In particular, this multiplication still makes sense in order to estimate performance for overtrained models with novel architectures, which is why it's not obvious that it won't normally be used like this. So there are two different possible ways of formulating effective compute for overtrained models, which are both useful for different purposes. I was under the impression that simply multiplying by a compute multiplier measured by comparing performance of Chinchilla optimal models of different architectures is how effective compute is usually formulated even for overtrained models, and that the meaning of the other possible formulation that you've pointed out is usually discussed in terms of perplexity or more explicitly Chinchilla optimal models with equivalent performance, not in the language of effective compute.

Can someone create a forecasting question about which model will score better in benchmarks, Claude 3.5 Opus or GPT-5?

The race is on.

I'd be interested in chatting about this with you and others — it's not obvious that Anthropic releasing better models makes OpenAI go nontrivially faster / not obvious that "the race" is real.


Not sure about the accuracy of this graph, but the general picture seems to match what companies claim, and the vibe is racing.

Do think that there are distinct questions about "is there a race" vs. "will this race action lead to bad consequences" vs. "is this race action morally condemnable". I'm hoping that this race action is not too consequentially bad, maybe it's consequentially good, maybe it still has negative Shapely value even if expected value is okay. There is some sense in which it is morally icky.

Idk there's probably multiple ways to define racing, some of them are on at least

In my mental model, we're still in the mid-game, not yet in the end-game.


A thing I've been thinking about lately is "what does it mean to shift from the early-to-mid-to-late game". 

In strategy board games, there's an explicit shift from "early game, it's worth spending the effort to build a longterm engine. At some point, you want to start spending your resources on victory points." And a lens I'm thinking through is "how long does it keep making sense to invest in infrastructure, and what else might one do?"

I assume this is a pretty different lens than what you meant to be thinking about right now but I'm kinda curious for whatever-your-own model was of what it means to be in the mid vs late game.

Like, in Chess you start off with a state where many pieces can't move in the early game, in the middle game many pieces are in play moving around and trading, then in the end game it's only a few pieces, you know what the goal is, roughly how things will play out.

In AI it's like only a handful of players, then ChatGPT/GPT-4 came out and now everyone is rushing to get in (my mark of the start of the mid-game), but over time probably many players will become irrelevant or fold as the table stakes (training costs) get too high.

In my head the end-game is when the AIs themselves start becoming real players.

Really, the race started more when OpenAI released GPT-4, it's been going on for a while, this is just another event that makes it clear.

To be clear, I think the race was already kind of on, it's not clear how much this specific action gets credit assignment and it's spread out to some degree. Also not clear if there's really a viable alternative strategy here...