Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
We care about the performance prediction at a given point in time for skills like "take over the world", "invent new science", and "do RSI" (and "automate AI R&D", which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent's follow up work, it seems like we're facing down at least three problems:
So my overall take is that I think the current work I'm aware of tells us
That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I'm not Adam, but my response is "No", based on the description Megan copied in thread and skimming some of the paper. It's good that the paper includes those experiments, but they don't really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Do the experiments in Sec 6 deal with this?
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) ...However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don't have very messy tasks.
c. SWE-Bench Verified: doesn't speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they're still clearly verifiable and still software engineering.
I do think Thomas and Vincent's follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
Any guesses for what's going on in Wichers et al. 3.6.1?
3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these" will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
Hey, thanks for the feedback! I helped write this section. A few notes:
I think you're right that comparing to consumer GPUs might make more sense, but I think comparing to other computers is still acceptable. I agree that GPUs is where you start running into prohibitions first. But I think it’s totally fair to compare to “average computers” because one of the main things I care about is the cost of the treaty. It’s not so bad if we have to ban top of the line consumer GPUs, but it would be very costly / impossible if we have to ban consumer laptops. So comparing to both of these is reasonable.
The text says "consumer CPUs" because this is what is discussed in the relevant source, and I wanted to stick to that. Due to some editing that happened, it might not have been totally clear where the claim was coming from. The text has been updated and now there's a clear footnote.
I know that "consumer CPUs" is not literally the best comparison for, say, consumer laptops. For example, macbooks have an integrated CPU-GPU. I think it is probably true that H100s are like 3-300x better than most consumer laptops at AI tasks, but to my knowledge there is no good citable work explaining this for a wide variety of consumer hardware (I have some mentees working on it now, maybe in a month or two there will be good work!).
I'll toss out there that NVIDIA sells personal or desktop GPUs that are marketed as for AI (like this one). These are quite powerful, often within 3x of the datacenter GPUs in terms of most of their performance. I expect these to get categorized as “AI chips” under the treaty and thus become controlled. The difference between H100s and top consumer GPUs is not 1000x, and it probably isn’t even 10x. In this tentative draft treaty, we largely try to punt questions like "what exactly counts as an AI chip" to the hypothetical technical body that helps implement the treaty, and my current opinions about this are weak.
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
There are a lot of bad things that AIs can do before literally taking over the world.
Okay, it does sound like you're saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I'm misunderstanding though, it's possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that's an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I'm worried about.
Whether that would be enough to take over the world at that point in time is a different questoin.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn't think the game would ever happen, that sounds like it's the real crux, not some complicated argument about how well the training drills match the game.
Either I don't understand your view about continuous development and threshold things, or I think it's probably wrong. I've decided to spin this out into its own comment, though I originally wrote it as part of this one.
I'm going to think out loud: One story I could tell for how we cross the gap from "the AI would not succeed at taking over" to "it would succeed if it tried", in a continuous manner, is as @Nina Panickssery points out in a tweet, "The AI could be more or less likely to succeed".
(My understanding is that we're actually focusing on the AI's beliefs about its likelihood of success, which I'll call AI-p(success), not p(success) according to some observer—the AI's beliefs are what shape its decisions.) So here's an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don't know that that's true. And we don't know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don't get any more iteration after day 22.
Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they're able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we'll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.
There are many issues with this plan:
I could be totally misunderstanding Nina's idea, this is all very complicated.
I'm not sure what Eliezer thinks, but I don't think it's true that "you cannot draw any useful lessons from [earlier] cases", and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it's left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that "try to take over" is a more serious strategy, somewhat tautologically because the book defines the gap as:
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)
I don't see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you're just accepting that there is this one problem that we won't be able to get direct evidence about in advance, but you're optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say "by studying which alignment methods scale and do not scale, we can obtain valuable information", my interpretation is that you're basically saying "by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D". Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can't tell if the sticking point is that people don't actually believe in the second regime.
I don't believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
There are rumors that many capability techniques work well at a small scale but don't scale very well. I'm not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it's pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
it is easier for me to get speakers from OpenAI
FWIW, I’m a bit surprised by this. I’ve heard of many AI safety programs succeeding in getting a bunch of interesting speakers from across the field. In case you haven’t tried very hard, consider sending some cold emails, the hit rate is often high in this community.
I think this is true, but I also don't think seeing the skulls implies actually dealing with them (and wish Scott's post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is "Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI." I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don't think I'm disagreeing with you at all)