When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I've seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.
contacting authorities if the user does something that Claude considers illegal,
It was my understanding that Anthropic presented that as a desired feature in keeping with their vision for how Claude should work, at least when read in the context of the rest of their marketing, instead of as a common flaw that they tried but were unable to fix.
It may be true that all cars sometimes fail to start, but if a car company advertised theirs as not starting if you're parked at a meter that's run out to ensure you get ticketed (very alignment, many safety), it's reasonable to vilify them in particular.
From the Claude 4 System Card, where this was originally reported on:
> This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.
I think this is pretty unambiguous from Anthropic that they aren't in favour of Claude behaving in this way ("concerning extremes").
I think that impression of Anthropic as pursuing some myopic "safety is when we know best" policy was whipped up by people external to Anthropic for clicks, at least in this specific circumstance.
The Frontier Model Forum exists in part to solve exactly this problem
...
Support the Frontier Model Forum.
My sense is that industry groups like this vary pretty widely across industries in their practical effect. Some are just a cheap way to get good PR; others are actually powerful and independent bodies. I genuinely have no idea where in that range the FMF falls, but I think it's worth noting that the mere existence on paper of such a body doesn't tell us much.
Some factors that seem relevant:
I want to be very clear that I'm not accusing the FMF of being powerless or safetywashing; they might be terrific, and I very much hope they are! I'm just saying that (especially given the obvious incentives for safetywashing), I would personally want to know more about them before deciding whether they warranted support from the AI safety community.
If anyone reading this is an AI safety researcher at one of the member labs and has opinions about how effective and independent the FMF is, I'd love to hear them!
You're totally right to point this out, thank you! I found the FMF quite late on while writing, and my research was mostly limited to their own writing (e.g. the announcement of their facilitated agreement with frontier labs). I probably shouldn't have gone as far as advocating support for a specific organisation without more independent verification of effectiveness at addressing the issues in this post, especially since the full agreement isn't public, meaning I couldn't dig into any specifics (e.g. how it will be enforced, if at all).
That said, I think bodies like the FMF could play an important coordination role between frontier labs if effective, and I'm glad they exist; for example, it seems possible they're well-positioned to facilitate private inter-lab comms channels where sharing of safety research can occur without requiring full publication, which could lower the commercial-risk barrier for sharing sensitive research. I imagine decision-makers at labs (e.g. legal/comms) might be more willing to sign off on x-risk research being shared with other labs but not the wider public[1], since there's less of a potential PR concern.
Perhaps a better call-to-action would have been "engage with the FMF and similar bodies, and push for them to be a success" - thanks for making this point!
Not that I'm endorsing this kind of work being kept from the public...
" Especially for corporate customers, safety and reliability are important unique selling points,
Only very prosaic, not catastrophic risks(as a customer I would not care at all about likelihood of catastrophic risks from my vendor - that's something that affects humanity regardless of customer relationships).
Only if a particular secret is useful for preventing both catastrophes and prosaic 'safety failures' would this be a consideration for us - catastrophic risks increasing due to companies trying to have a competitive edge in prosaic safety risks
Sorry for the late reply!
I can see why you'd say that, but I think for me the two are often intermingled and hard to separate. Even assuming that the most greedy/single-minded business leaders wouldn't care about catastrophic risks on a global scale (which I'm not sure I buy on its own), they're probably going to want to avoid the economic turbulence which would ensue from egregiously-misaligned, capable AIs being deployed.
For a more fine-grained example, actions like siphoning compute to run unauthorised tasks might be a signal that a model poses significantly higher catastrophic risk, but would also be something a commercial business would want to prevent for their own reasons (e.g. cost, lower performance, etc.). If a lab can demonstrate that their models won't attempt things of this nature, that's a win for the commercial customers.
Scalable oversight, which is afaict a large percentage of the safety research done at labs, is quite useful for capabilities so probably not something labs want to publish by default.
That makes sense. Dual-use work is definitely hard to deal with, and I don't have any good ideas on how to diffuse the safety-relevant parts across labs with acceptably low risk of spreading capabilities-related knowledge (if it's even possible to separate these parts). Perhaps a trusted intermediary could help evaluate which details are safe to share? Do you have any ideas on how to approach publishing dual-use safety research responsibly (if at all)?
Epistemic status: Based on multiple accounts, I’m confident that frontier labs keep some safety research internal-only, but I’m much less confident on the reasons underlying this. Many benign explanations exist and may well suffice, but I wanted to explore other possible incentives and dynamics which may come into play at various levels. I've tried to gather information from reliable sources to fill my knowledge/experience gaps, but the post remains speculative in places.
(I'm currently participating in MATS 8.0, but this post is unrelated to my project.)
There might be very little time in which we can steer AGI development towards a better outcome for the world, and an increasing number of organisations (including frontier labs themselves) are investing in safety research to try and accomplish this. However, without the right incentive structures and collaborative infrastructure in place, some of these organisations (especially frontier labs) may not publish their research consistently, leading to slower overall progress and increased risk.
In addition to more benign reasons such as time costs and incremental improvements (which likely explain the bulk of unpublished safety research today), I argue there may also exist incentives that could result in safety hoarding, where AI labs choose not to publish important frontier safety research straight away for reasons related to commercial gain (e.g. PR, regulatory concerns, marketing). Independent of the underlying reasons, keeping safety research internal likely results in duplicated effort across safety teams and with external researchers, and introduces the risk of other labs pushing the capability frontier forward without use of these proprietary safety techniques.
This points to a need for organisations like the Frontier Model Forum to exist, with the goal of facilitating research-sharing across both competitors and external research organisations to ensure that teams pursuing vital safety work have access to as much information as possible, hopefully boosting odds of novel research outputs and overall safer models.
Note: The purpose of this post is not to criticise the work or motivations of individual lab safety researchers! It's solely intended as an exploration into some of the dynamics and incentives which might be at play across different levels, and which may lead to their work being withheld from the public domain.
Why might decision-makers at AI labs choose to hoard safety research, and not publish it straight away?
In no particular order:
NB: I'm leaving obvious dual-use safety research (e.g. successful jailbreaks or safeguard weaknesses) out of scope from this discussion, as I think there are reasonable infohazard concerns regarding publishing this type of research in the public domain. These are discussed separately below.
What other factors might lead to labs not publishing safety research?
When can we make a case for not publishing certain safety research?
What problems may arise as a result of AI labs hoarding safety research?
What possible upsides are there to AI labs hoarding safety research?
Disclaimer #1: This section contains some information I learned through a combination of searching online for details of labs' publication policies, and asking current and former employees directly. As a result of time and contact constraints, this isn't perfectly thorough or balanced, and some information is more adjacent to the topic than directly related.
Disclaimer #2: I want to acknowledge that there are some really encouraging examples of all three big labs (Anthropic, OAI, GDM) prioritising the publication of safety research. Some positive examples which come to mind at time of writing, though there are likely significant omissions from this list: Anthropic's decision to publish work on Constitutional AI and work on alignment faking with Redwood Research; OpenAI's decision to publish a warning on chain-of-thought obfuscation risks and the large cross-party collaborative warning that followed shortly afterward; GDM's 100-pager on their approach to AGI safety.
Calvin French-Owen's recent post Reflections on OpenAI (July 2025) specifically mentions that much of OAI's safety work remains unpublished[9]:
Safety is actually more of a thing than you might guess if you read a lot from Zvi or Lesswrong. There's a large number of people working to develop safety systems. Given the nature of OpenAI, I saw more focus on practical risks (hate speech, abuse, manipulating political biases, crafting bio-weapons, self-harm, prompt injection) than theoretical ones (intelligence explosion, power-seeking). That's not to say that nobody is working on the latter, there's definitely people focusing on the theoretical risks. But from my viewpoint, it's not the focus. Most of the work which is done isn't published, and OpenAI really should do more to get it out there.
A series of comments on Rauno Arike's shortform post on this subject also indicates some pressure not to publish may occur (albeit variably):
Recently, I asked an ex-OAI policy researcher about their experiences. They described a worsening open-publication environment at OAI over the past few years, with approval to publish now requiring jumping through more hoops, and additionally noted that it's “much harder to publish things in policy if you're publishing a problem without also publishing a solution."
Another ex-OAI MoTS said that the Superalignment team mostly had freedom to publish, although this information may now be outdated since the team was disbanded. They also mentioned that breakthroughs tend to "diffuse around informally" which may help co-located labs[10] unofficially share their research.
An April 2025 FT article (Melissa Heikkilä and Stephen Morris) details recent changes to GDM's publication restriction policies. According to the report, GDM has implemented new barriers to research publication, including a six-month embargo on "strategic" generative AI papers and a need to convince multiple staff members before publication.
Former researchers interviewed by the FT indicated that the organisation has become particularly cautious about releasing papers that might benefit competitors or reflect unfavourably on Gemini's capabilities or safety compared to rivals like GPT-4.
One former DeepMind research scientist told the FT: "The company has shifted to one that cares more about product and less about getting research results out for the general public good."
In late 2023, I asked a prominent MoTS for their thoughts on Anthropic's approach to safety research publication. Their answer was that Anthropic had a policy of publishing AI safety research even if it hurt them commercially, but also that not everything is published. My impression was that decisions are made on a case-by-case basis, by weighing up the positive impact of releasing the research against dual-use risks and the commercial downsides of publishing, with more weight on the side of publishing.
I recently asked another lead Anthropic researcher the same question, and they said basically the same thing: researchers are “generally very free to publish things” but noted the opportunity cost of doing so, and that often teams “do some calculus” to balance the usefulness of the research vs. the effort required to publish it.
How can we move towards a world in which safety research is conducted as openly as possible?
Disclaimer: Below the first point about the FMF, much of this feels vague and I'm looking for people with more expertise to chime in here. I found the FMF after writing the rest of this post, and their work seems very aligned with solving these issues.
What about research that's hard to classify as safety vs. capabilities?
Example: RLHF. Intended as an alignment technique, RLHF enabled the rise of LLMs-as-chatbots, leading to lab profitability and more capabilities research. Paul Christiano's 2023 review of RLHF describes the subtleties of assessing the impact of this work.
This kind of example is complex and widely debated, and it feels fundamentally hard to analyse these second-order effects. It's possible that much safety research falls into this category, making publication practices difficult to pin down beyond a case-by-case basis.
Are there any examples from other industries of large corporations withholding safety information from the public domain?
Disclaimer: Claude Sonnet 4's Research mode was used to collect these examples, which I researched and validated independently before including them in this post.
Autonomous vehicle (AV) companies are reluctant to share crash data. Sandhaus et al. (April 2025) investigated the reasons why AV crash data was often kept within companies, and concluded that a previously-unknown barrier to data sharing was hoarding AV safety knowledge as a competitive edge:
I believe this constitutes an example of safety hoarding in the AV industry, and I suspect the effects may have been felt earlier in AV development because (a) unsafe self-driving cars feels like less of a conceptual leap compared to AGI safety, so demand for safety is already high among the buying public; (b) the narrow domain invites tighter regulation faster than general-use AI; and (c) AVs are currently prevented from deploying "capabilities work" until they can demonstrate safety, which is not the case for LLMs.
Other aspects of the AV case also appear to be quite analogous to the AI safety sharing problem, and I learned a lot reading this study. In particular:
A few of the proposed solutions in this study are also interesting with respect to the AI safety hoarding problem:
In this post, I introduced safety hoarding as a potential risk factor in AI safety research, and laid out some reasons I believe this may arise in real lab research. There is precedent for this effect in both modern (AV) and older (automotive) industries, and some signs this may already be occurring in frontier labs. While there are certain cases in which publishing restrictions might be the right move (dual-use/infohazards), without regulatory insight into the publishing policies of frontier labs and the state of their internal research, those outside of the labs cannot know the full extent of internal safety research.
This makes coordination of this vital work difficult, and without mechanisms in place to help with this, teams across organisations may waste time, money, and compute on work that's being duplicated in many places. While a slowdown effect like this might be a desired consequence for capabilities research, it seems highly undesirable for safety work to be hamstrung by poor coordination and collaboration between separate research groups. Finding ways to facilitate open sharing of safety work across labs, governmental institutions, academia, and external organisations seems like an important problem to address in order to globally lift the floor of AI safety research and reduce risks from this research being kept proprietary.
Thanks to Marcus Williams, Rauno Arike, Marius Hobbhahn, Simon Storf, James Peters-Gill, and Kei Nishimura-Gasparian for feedback on previous drafts of this post.
See The Current State below.
I'm not including safety researchers themselves here. These incentives seem much more likely to affect legal/comms/strategy/leadership teams, which I'm broadly terming "decision-makers".
I learned about the FMF's existence after this post was drafted. I'm not affiliated with them, but felt their work overlapped significantly with the content of this post.
This can somewhat go the other way too: publishing is a great way to build a reputation. The concern here is about exclusivity; (a) publishing in such a way that other labs can use the research themselves, and (b) publishing more than is required to keep a "safety lead" in case it's widely useful.
Naïvely you might want the safest lab to be the only one that can operate, so other labs being prevented from doing so doesn't sound so bad. Instead, I think the world in which all labs can implement the same techniques (as far as architecturally possible) is still strictly safer, since in this case there's no pre-regulation period in which less-safe labs can operate freely without these measures - even if not all labs will care about implementing all techniques.
Counterpoint: If labs anticipate this, then publishing research now might be favourable, since reputations are built over long periods and research might be obsoleted or replicated in the time it's hoarded.
I weakly disagree: I think important research is much more likely to make a big impact in public discourse (e.g. going viral on social media, newspaper articles) if the public already care about it more, and it won't "make a splash" in the same way if they publish immediately and public opinion shifts later. If the research is particularly novel, other labs might not catch up by the point the public care about safety enough for the hoarding lab to finally publish.
Alex Turner notes this in his post: "I do not think that AI pessimists should stop sharing their opinions. I also don’t think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus."
Counterpoint: External researchers needn't worry so much about duplicating the work of frontier labs in a world where safety-hoarding is happening. A couple of brief arguments for this: (a) external research is completely public, so replicating this work openly is a good way to lift the safety-research floor across frontier labs; (b) replicating results using different experimental setups is good science, and may provide important validation of internal lab findings, or answer questions like generalisation of results across models.
Note that lack of publication doesn’t necessarily point to safety hoarding, as discussed in Neutral Explanations of Unpublished Work.
This type of diffusion feels unlikely to have global reach, e.g. reaching AGI labs in China.
For example, labs might try to reframe safety research as capabilities (exploiting the blurry line), or forgo investing as much in safety research at all.