On closed-door AI safety research

[-]Igor Ivanov3mo128

When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I've seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.

[-]Shankar Sivarajan3mo3-1

contacting authorities if the user does something that Claude considers illegal,

It was my understanding that Anthropic presented that as a desired feature in keeping with their vision for how Claude should work, at least when read in the context of the rest of their marketing, instead of as a common flaw that they tried but were unable to fix.

It may be true that all cars sometimes fail to start, but if a car company advertised theirs as not starting if you're parked at a meter that's run out to ensure you get ticketed (very alignment, many safety), it's reasonable to vilify them in particular.

[-]cosmobobak3mo21

From the Claude 4 System Card, where this was originally reported on:
> This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.

I think this is pretty unambiguous from Anthropic that they aren't in favour of Claude behaving in this way ("concerning extremes").

I think that impression of Anthropic as pursuing some myopic "safety is when we know best" policy was whipped up by people external to Anthropic for clicks, at least in this specific circumstance.

[-]wassname3mo10

Although they could have tested "LLM's" and not primarily Claude and that could have bypassed that effect.

[-]Igor Ivanov3mo43

This was written in the Claude 4 system card. Made sense to test Claude and not other LLMs

[-]eggsyntax3mo*40

The Frontier Model Forum exists in part to solve exactly this problem

...

Support the Frontier Model Forum.

My sense is that industry groups like this vary pretty widely across industries in their practical effect. Some are just a cheap way to get good PR; others are actually powerful and independent bodies. I genuinely have no idea where in that range the FMF falls, but I think it's worth noting that the mere existence on paper of such a body doesn't tell us much.

Some factors that seem relevant:

How many full-time employees does the FMF have? Who do they answer to?
How many employees at frontier labs have FMF-related work as all or part of their job responsibilities?
What budget does the FMF have?
It looks like the AI Safety Fund is nominally under the auspices of the FMF; would it have happened if the FMF didn't exist?
Who sits on the nonprofit board?
I note that the commitment member orgs make (other than financial support for the FMF) is 'participating in FMF workshops and activities.' What workshops & activities have there been in the past three years, and what impact have they had?

I want to be very clear that I'm not accusing the FMF of being powerless or safetywashing; they might be terrific, and I very much hope they are! I'm just saying that (especially given the obvious incentives for safetywashing), I would personally want to know more about them before deciding whether they warranted support from the AI safety community.

If anyone reading this is an AI safety researcher at one of the member labs and has opinions about how effective and independent the FMF is, I'd love to hear them!

[-]richbc3mo32

You're totally right to point this out, thank you! I found the FMF quite late on while writing, and my research was mostly limited to their own writing (e.g. the announcement of their facilitated agreement with frontier labs). I probably shouldn't have gone as far as advocating support for a specific organisation without more independent verification of effectiveness at addressing the issues in this post, especially since the full agreement isn't public, meaning I couldn't dig into any specifics (e.g. how it will be enforced, if at all).

That said, I think bodies like the FMF could play an important coordination role between frontier labs if effective, and I'm glad they exist; for example, it seems possible they're well-positioned to facilitate private inter-lab comms channels where sharing of safety research can occur without requiring full publication, which could lower the commercial-risk barrier for sharing sensitive research. I imagine decision-makers at labs (e.g. legal/comms) might be more willing to sign off on x-risk research being shared with other labs but not the wider public^[1], since there's less of a potential PR concern.

Perhaps a better call-to-action would have been "engage with the FMF and similar bodies, and push for them to be a success" - thanks for making this point!

^{^}
Not that I'm endorsing this kind of work being kept from the public...

[-]Shubhorup Biswas2mo10

" Especially for corporate customers, safety and reliability are important unique selling points,

Only very prosaic, not catastrophic risks(as a customer I would not care at all about likelihood of catastrophic risks from my vendor - that's something that affects humanity regardless of customer relationships).

Only if a particular secret is useful for preventing both catastrophes and prosaic 'safety failures' would this be a consideration for us - catastrophic risks increasing due to companies trying to have a competitive edge in prosaic safety risks

[-]richbc1mo10

Sorry for the late reply!

I can see why you'd say that, but I think for me the two are often intermingled and hard to separate. Even assuming that the most greedy/single-minded business leaders wouldn't care about catastrophic risks on a global scale (which I'm not sure I buy on its own), they're probably going to want to avoid the economic turbulence which would ensue from egregiously-misaligned, capable AIs being deployed.

For a more fine-grained example, actions like siphoning compute to run unauthorised tasks might be a signal that a model poses significantly higher catastrophic risk, but would also be something a commercial business would want to prevent for their own reasons (e.g. cost, lower performance, etc.). If a lab can demonstrate that their models won't attempt things of this nature, that's a win for the commercial customers.

[-]Julian Stastny3mo10

Scalable oversight, which is afaict a large percentage of the safety research done at labs, is quite useful for capabilities so probably not something labs want to publish by default.

[-]richbc3mo10

That makes sense. Dual-use work is definitely hard to deal with, and I don't have any good ideas on how to diffuse the safety-relevant parts across labs with acceptably low risk of spreading capabilities-related knowledge (if it's even possible to separate these parts). Perhaps a trusted intermediary could help evaluate which details are safe to share? Do you have any ideas on how to approach publishing dual-use safety research responsibly (if at all)?

^{^}

See The Current State below.

^{^}

I'm not including safety researchers themselves here. These incentives seem much more likely to affect legal/comms/strategy/leadership teams, which I'm broadly terming "decision-makers".

^{^}

I learned about the FMF's existence after this post was drafted. I'm not affiliated with them, but felt their work overlapped significantly with the content of this post.

^{^}

This can somewhat go the other way too: publishing is a great way to build a reputation. The concern here is about exclusivity; (a) publishing in such a way that other labs can use the research themselves, and (b) publishing more than is required to keep a "safety lead" in case it's widely useful.

^{^}

Naïvely you might want the safest lab to be the only one that can operate, so other labs being prevented from doing so doesn't sound so bad. Instead, I think the world in which all labs can implement the same techniques (as far as architecturally possible) is still strictly safer, since in this case there's no pre-regulation period in which less-safe labs can operate freely without these measures - even if not all labs will care about implementing all techniques.

^{^}

Counterpoint: If labs anticipate this, then publishing research now might be favourable, since reputations are built over long periods and research might be obsoleted or replicated in the time it's hoarded.
I weakly disagree: I think important research is much more likely to make a big impact in public discourse (e.g. going viral on social media, newspaper articles) if the public already care about it more, and it won't "make a splash" in the same way if they publish immediately and public opinion shifts later. If the research is particularly novel, other labs might not catch up by the point the public care about safety enough for the hoarding lab to finally publish.

^{^}

Alex Turner notes this in his post: "I do not think that AI pessimists should stop sharing their opinions. I also don’t think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus."

^{^}

Counterpoint: External researchers needn't worry so much about duplicating the work of frontier labs in a world where safety-hoarding is happening. A couple of brief arguments for this: (a) external research is completely public, so replicating this work openly is a good way to lift the safety-research floor across frontier labs; (b) replicating results using different experimental setups is good science, and may provide important validation of internal lab findings, or answer questions like generalisation of results across models.

^{^}

Note that lack of publication doesn’t necessarily point to safety hoarding, as discussed in Neutral Explanations of Unpublished Work.

^{^}

This type of diffusion feels unlikely to have global reach, e.g. reaching AGI labs in China.

^{^}

For example, labs might try to reframe safety research as capabilities (exploiting the blurry line), or forgo investing as much in safety research at all.

LESSWRONG
LW

LESSWRONG
LW

76

On closed-door AI safety research

76

76

Executive Summary

Introduction

Hoarding Incentives

Neutral Explanations of Unpublished Work

In Favour of Selective Withholding

Consequences of Safety Hoarding

Negative

(Possibly) Positive

The Current State

OpenAI

Google DeepMind

Anthropic

A Move Towards Openness

Grey Areas

Precedents and Parallels

Case Study: AV Safety Hoarding

Other Automotive Examples

Conclusion

Future Work

Acknowledgements