LESSWRONG
LW

LawrenceC

I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.

I'm also a part-time fund manager for the LTFF.

Obligatory research billboard website: https://chanlawrence.me/

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Existing Safety Frameworks Imply Unreasonable Confidence

LawrenceC16d20

Thanks for writing this!

I agree that current frontier AI risk management frameworks (henceforth RMFs) are insufficient to ensure acceptably low risks from AI development. I also think that you've pointed out the key reasons for why they're inadequate: not addressing risks from development or internal deployment, a lack of planning for what happens assuming dangerous capabilities are detected, and (most importantly) a reliance on only checking for a few known risks (as operationalized by current evals). I think that the main value proposition of current RMFs is transparency -- that is, it's good to lay out what the evals and plans, and it's good for people to do very basic checks for model capabilities. (and I think that this is a common view at METR).

The underlying assumption seems to be that, even with DeepMind’s limited understanding, there must be a way to push forward safely without radical changes to their approach.

I also agree that this is one way to characterize an underlying assumption behind most RMFs (though not necessarily all, and the people writing them would not phrase it this way).

Another perspective on this underlying assumption is that people don't think of AI development, but instead think of it as software development and correctly note that "real" RMFs wouldn't allow people to develop software as it's normally developed. So actual RMFs end up a compromise between what they are in the traditional sense and what doesn't over encumber the normal software development cycle.

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC21d20

Super fair. I did not read that section in detail, and missed your interpretation on my skim. I interpreted it to mean "in the planning literature, we typically see k=3 for n<6" (which we do!), noted that they did not say the alternative value, and then went with the k=3 value they did say in the main body.

You're right that your interpretation is more natural. If they actually used k=4, then the problem is solvable and the paper was (a bit) better than I portrayed it to be here.

Worth noting that, even assuming your interpretation is correct, it's possible is that the person who wrote the problem spec did know about the impossibility result, but the person running the experiments did not (and thus the experiments were ran with k=3). But it does make it more likely that k=4 was used for n>5.

I think my statement above that "they did not say" "a different value of k (such that the problem is possible)" still seems true. And if they used k=4 because they knew that k=3 is impossible, it seems quite sloppy to say that they used k=3 in the main body. But a poorly written paper is a different and lesser problem to the experiment being wrong.

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC21dΩ440

Fair, but in my head I did plan to get it done on the 10th. The tweet is not in itself the prediction, it's just evidence that I made the prediction in my head.

And indeed I did finish the draft on June 10th, but at 11 PM and I decided to wait for feedback before posting. So I wasn't that off in the end, but I still consider it off.

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC21d70

Oh, fair. Thanks for the correction, I didn't realize how much artists were affected.

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC22d53

This isn't my area of expertise, so take what I have to say with a grain of salt.

I wouldn't say that anyone is particularly on the ball here, but there are certainly efforts to bring about more "mainstream" awareness about AI. See e.g. the recent post by the ControlAI folk about their efforts to brief British politicians on AI, in which they mention some of the challenges they ran into. Still within the greater LW/80k circle, we have Rational Animations, which has a bunch of AI explainers via animated Youtube videos. Then slightly outside of it, we have Rob Miles's videos, both on his own channel and on Computerphile. Also, I'd take a look at the PauseAI folk to see if they've been working on anything lately.

My impression is that the primary problem is "Not feeling the AGI", where I think the emphasis should placed on "AGI". People see AI everywhere, but it's kinda poopy. There's AI slop and AI being annoying and lots of startups hyping AI, but all of those suggest AI to be mediocre and not revolutionary. Outside of entry-level SWE jobs I don't think people have really felt much disruption from an employment perspective. It's also just hard to distinguish between more or less powerful models ("o1" vs "o3" vs "gpt-4o" vs "gpt-4.5" vs "ChatGPT", empirically I notice that people just give up and use "ChatGPT" to refer to all OAI LMs).

A clean demonstration of capabilities tends to shock people into "feeling the AGI". As a result, I expect this to be fixed in part over time, as we get more events like the release of GPT-3 (a clean demo for academics/AIS folk), the ChatGPT release in late 2022, or the more recent Ghiblification wave. I also think that, unfortunately, "giving informed and strongly evidence backed claims about AI progress" is just not easy to do for as broad an audience as National TV, so maybe demos are just the way to go.

Secondarily, I think the second problem is fatalism. People switch from "we won't get AGI, those people are just being silly" to "we'll get AGI and OAI/Anthropic/GDM/etc will take over/destroy the world, there's nothing we can do". In some cases this even goes into "I'll choose not to believe it, because if I do, I'll give up on everything and just cry." To be honest, I'm not sure how to fix this one (antidepressants, maybe?).

Then there's the "but China" views, but I think that's much more prevalent on Twitter or the words of corporate lobbyists than a response I've heard from people in real life.

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC22d40

The authors specify that they used k=3 on page 6 of their paper:

If they used a different k (such that the problem is possible), they did not say so in their paper.

You're right that I could've been more clear here, I've edited in a footnote to clarify.

Beware General Claims about “Generalizable Reasoning Capabilities” (of Modern AI Systems)

LawrenceC22dΩ241

There are indeed many, many silly claims out there, on either side of any debate. And yes, the people pretending that the AIs of 2025 have the limitations of those from 2020 are being silly, journalist or no.

I do want to clarify that I don't think this is a (tech) journalist problem. Presumably when you mention Nightshade dismissively, it's a combination of two reasons: 1) Nightshade artefacts are removable via small amounts of Gaussian blur and 2) Nightshade can't be deployed at scale on enough archetypal images to have a real effect? If you look at the Nightshade website, you'll see that the authors lie about 1):

As with Glaze, Nightshade effects are robust to normal changes one might apply to an image. You can crop it, resample it, compress it, smooth out pixels, or add noise, and the effects of the poison will remain.

So (assuming my recollection that Nightshade is defeatable by Gaussian noise is correct) this isn't an issue of journalists making stuff up or misunderstanding what the authors said, it's the authors putting things in their press release that, at the very least, are not at all backed up by their paper.

(Also, either way, Gary Marcus is not a tech journalist!)

Introducing the WeirdML Benchmark

LawrenceC6mo20

It would be really great with human baselines, but it’s very hard to do in practice. For a human to do one of these tasks it would take several hours.

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

I don’t really have any funding for this project, but I might find someone that wants to do one task for fun, or do my best effort myself on a fresh task when I make one.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

Introducing the WeirdML Benchmark

LawrenceC6mo2-2

Makes sense, thanks!

For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

It's hard to say because I'm not even sure you can rent Titan Vs at this point,^[1] and I don't know what your GPU utilization looks like, but I suspect API costs will dominate.

An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.

So if o1 costs $2 per task and it's 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)

^{^}
I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.

Introducing the WeirdML Benchmark

LawrenceC6mo40

This is really impressive -- could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?

(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)