Aaron_Scher — LessWrong

Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.

https://ascher8.github.io/

they have indeed seen the skulls

I think this is true, but I also don't think seeing the skulls implies actually dealing with them (and wish Scott's post was crossposted here so I could argue with it). Like, a critique of AI evaluations that people could have been making for the last 5+ years (probably even 50) and which remains true today is "Evaluations do a poor job measuring progress toward AGI because they lack external validity. They test scenarios that are much narrower, well defined, more contrived, easier to evaluate, etc. compared to the skills that an AI would need to be able to robustly do in order for us to call it AGI." I agree that METR is well aware of this critique, but the critique is still very much true of HCAST, RE-Bench, and SWAA. Folks at METR seem especially forward about discussing the limitations of their work in this regard, and yet the critique is still true. (I don't think I'm disagreeing with you at all)

We care about the performance prediction at a given point in time for skills like "take over the world", "invent new science", and "do RSI" (and "automate AI R&D", which I think the benchmark does speak to). We would like to know when those skills will be developed.

In the frame of this benchmark, and Thomas and Vincent's follow up work, it seems like we're facing down at least three problems:

The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don't know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don't know how well the general "time horizons" approach works across domains. We have some data on this in the follow up work, maybe it's a 2:1 update from a 1:1 prior?

So my overall take is that I think the current work I'm aware of tells us

Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on "time horizons" being common across different reasonable and easy to define distributions.
And on "doubling time in the single digit months" being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50/50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.

That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?

I'm not Adam, but my response is "No", based on the description Megan copied in thread and skimming some of the paper. It's good that the paper includes those experiments, but they don't really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):

Conceptual coherence: in humans there are different skills, e.g., between different fields, that don't seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we're worried about.

Do the experiments in Sec 6 deal with this?

No SWAA ("Retrodiction from 2023–2025 data"): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don't think it's conclusive. Quoting from the paper some:

We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent
performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world
context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness
factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions
can be found in Appendix D.4.
The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the
task’s length alone (b=-0.081, R2 = 0.251) ...
However, trends in AI agent performance over time are similar for lower and higher messiness
subsets of our tasks.

This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don't have very messy tasks.

c. SWE-Bench Verified: doesn't speak to 1 or 2.

d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they're still clearly verifiable and still software engineering.

I do think Thomas and Vincent's follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.

Any guesses for what's going on in Wichers et al. 3.6.1?

3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).

It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.

More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these" will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.

Hey, thanks for the feedback! I helped write this section. A few notes:

I think you're right that comparing to consumer GPUs might make more sense, but I think comparing to other computers is still acceptable. I agree that GPUs is where you start running into prohibitions first. But I think it’s totally fair to compare to “average computers” because one of the main things I care about is the cost of the treaty. It’s not so bad if we have to ban top of the line consumer GPUs, but it would be very costly / impossible if we have to ban consumer laptops. So comparing to both of these is reasonable.

The text says "consumer CPUs" because this is what is discussed in the relevant source, and I wanted to stick to that. Due to some editing that happened, it might not have been totally clear where the claim was coming from. The text has been updated and now there's a clear footnote.

I know that "consumer CPUs" is not literally the best comparison for, say, consumer laptops. For example, macbooks have an integrated CPU-GPU. I think it is probably true that H100s are like 3-300x better than most consumer laptops at AI tasks, but to my knowledge there is no good citable work explaining this for a wide variety of consumer hardware (I have some mentees working on it now, maybe in a month or two there will be good work!).

I'll toss out there that NVIDIA sells personal or desktop GPUs that are marketed as for AI (like this one). These are quite powerful, often within 3x of the datacenter GPUs in terms of most of their performance. I expect these to get categorized as “AI chips” under the treaty and thus become controlled. The difference between H100s and top consumer GPUs is not 1000x, and it probably isn’t even 10x. In this tentative draft treaty, we largely try to punt questions like "what exactly counts as an AI chip" to the hypothetical technical body that helps implement the treaty, and my current opinions about this are weak.

Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.

There are a lot of bad things that AIs can do before literally taking over the world.

Okay, it does sound like you're saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.

Maybe I'm misunderstanding though, it's possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that's an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I'm worried about.

Whether that would be enough to take over the world at that point in time is a different questoin.

To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn't think the game would ever happen, that sounds like it's the real crux, not some complicated argument about how well the training drills match the game.

Either I don't understand your view about continuous development and threshold things, or I think it's probably wrong. I've decided to spin this out into its own comment, though I originally wrote it as part of this one.

I'm going to think out loud: One story I could tell for how we cross the gap from "the AI would not succeed at taking over" to "it would succeed if it tried", in a continuous manner, is as @Nina Panickssery points out in a tweet, "The AI could be more or less likely to succeed".

(My understanding is that we're actually focusing on the AI's beliefs about its likelihood of success, which I'll call AI-p(success), not p(success) according to some observer—the AI's beliefs are what shape its decisions.) So here's an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don't know that that's true. And we don't know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don't get any more iteration after day 22.

Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they're able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we'll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.

There are many issues with this plan:

We don't actually know the decision process of the AI and don't know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there's a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
Even if we know the high-level decision process of the AI, we don't know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy "only attempt take over if AI-p(success) ≥ 99%", then we're taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
It's not obvious that we'll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can't actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it's plausible that we'll get this, I might be at like 50/50. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
I don't expect us to know observer-p(success) with much confidence. It seems like most of the claims from AI companies about current AI capabilities are pretty soft, unconfident claims. For instance, OpenAI, Anthropic, and Google DM have all said in the last few months that their frontier models might be capable of significant CBRN (bio especially) uplift for unskilled actors, but they aren't sure. OpenAI: "We do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, our defined threshold for High capability, and the model remains on the cusp of being able to reach this capability."Anthropic: "Although we have not determined whether the ASL-3 Standard is required for this model". Google: "Further study is required to reach our final assessment of whether the model has reached the CCL for CBRN Uplift Level 1." Now maybe this is just corporate hedging, but I think it's way more likely, and supported by a preponderance of evidence, that assessing AI capabilities, especially for real world tasks, is hard to do and we are currently bad at it. I basically expect that in a couple years we'll enter a regime of "the experts do not know if frontier AIs would be able to successfully escape their containment or successfully takeover if they tried, it seems like at least 1% likely" and probably our p(success) will go up as AI capabilities increase, but it won't be well justified ever and there will be significant disagreement about the number among experts.
It's unclear if we'll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don't try very hard at this kind of red-teaming, then we certainly won't succeed at this part.
Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It's unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
It's not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we'll just end up pushing the threshold up a couple days or changing the AI's decision process; it seems like there are a lot of ways we could fail at this step.

I could be totally misunderstanding Nina's idea, this is all very complicated.

I'm not sure what Eliezer thinks, but I don't think it's true that "you cannot draw any useful lessons from [earlier] cases", and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it's left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that "try to take over" is a more serious strategy, somewhat tautologically because the book defines the gap as:

Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)

I don't see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.

It seems like maybe you're just accepting that there is this one problem that we won't be able to get direct evidence about in advance, but you're optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.

When you say "by studying which alignment methods scale and do not scale, we can obtain valuable information", my interpretation is that you're basically saying "by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D". Is that right?

Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can't tell if the sticking point is that people don't actually believe in the second regime.

I don't believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.

There are rumors that many capability techniques work well at a small scale but don't scale very well. I'm not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it's pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.

it is easier for me to get speakers from OpenAI

FWIW, I’m a bit surprised by this. I’ve heard of many AI safety programs succeeding in getting a bunch of interesting speakers from across the field. In case you haven’t tried very hard, consider sending some cold emails, the hit rate is often high in this community.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments