Model access for third-parties — it's a big deal!

Cleo Nardo

Over time, there might be an increasingly large gap between insider model access and outsider model access. By insiders, I mean employees at the frontier lab.^[1] By "outsiders", I mean external safety researchers, third-party auditors, and other actors trying to make the future go well. I will call this a model access gap — and when the gap is small, I'll call this model access parity.^[2]

I think that one of the top priorities for the external AI safety community over the next 6-12 months should be ensuring model access parity. Main reasons:

This would allow us to direct billions of dollars in AI labour towards making things go well. This seems robustly good, regardless of what activities we decide to actually direct the labour towards.
I think publicly available models will probably lag 3-6 months behind the best internal models. Hence, as R&D uplift grows superexponentially, we might see the differential uplift grow from 2x to 60x. In short: I think achieving model access parity might be preferable to scaling the headcount of outsider orgs by ten-fold.
Model access parity isn't too far from the status quo, but it's the kind of thing that we could lose soon. I think precedent here might be sticky, so it seems like a good time to push for it.

To be clear: I think this is a big deal, it probably won't happen by default, and we are not on track to achieve it.

Which outsiders?

"Outsiders" includes the AI safety community outside the frontier labs, along with other actors trying to make the future go well. To make thing concrete, I've included a list below of orgs this might include.

I want to be clear that this list is intentionally expansive — it's not supposed to indicate which outsiders are top priority for model access parity. One might believe there are just 3-5 high-priority orgs, such that ensuring model access parity for these orgs would capture most of the value from a widespread model access parity. Alternatively, you might believe that the returns diminish slowly, and we should push for access for the entire list.

Examples of outsiders

People evaluating risk posed by current or imminent deployments (METR, Apollo, UK AISI, AI Lab Watch)
People tracking and forecasting capabilities (Epoch, METR, AIFP, etc)
Grantmakers (Coefficient Giving, FLF, OpenAI Foundation, Longview, Macroscopic, etc)
People building stuff like “AI for Epistemics” or "AI for coordination"
External groups developing techniques to be exported to the labs (Redwood, Truthful AI, Cooperative AI Foundation, FAR.ai, Geodesic, most MATS projects)
External researchers who want to do theoretical alignment research (Sequent/Timaeus/Simplex, ARC, Iliad/PIBBSS-y stuff, formerly-MIRI people, Guaranteed Safe AI)
External researchers working on ambitious mech interp.
People trying to do pursue non-ASI paths to good futures (brain uploads, adult intelligence enhancement)
Academic labs (Stephen Casper, Vincent Conitzer, etc)
Engineers who want to build tooling for the labs and others (Trancluce, Meridian Labs, training better classifiers for control)
External groups who want to make the future better, not via AI safety (Forethought, Acorn, ACS)
External groups trying to solve technical bottlenecks to slowdown (hardware verification, tracking covert datacentres)
d/acc things (cyber-defense, bio-defense, societal hardening, formally-verified code)
Maybe some government organisations (EUAIO, CAISI, UK AISI)

Who aren’t outsiders?

The public — Things would probably go better if the public had similar model access as the insiders. This would help with concentrations of power, wake up, and other risks described in Daniel Kokotajlo's Training AGI in Secret would be Unsafe and Unethical. But this article isn’t about that — it's just about model access for outsiders (which I think has different strategic properties).
Trailing labs — It might also be good if employees at trailing also had access to the best internal models at the leading. Maybe, I haven't thought much about it. But I'm ignoring them because: (i) there are a bunch of thorny issues here, which don't arise with third-parties, and (ii) the upside is lower because (presumably) the trailing lab have their own models.
Government — It might be good if the government had access to the best internal models. (Although, I'm a little worried if the government had much better access than the public.) In any case, this isn't my focus. Probably the government has a bunch of leverage over the labs which outsiders don't, so our interventions here would look pretty different.

What kinds of model access gap should we worry about?

Here are five mechanisms that could contribute to a model access gap. I've ordered them in decreasing order of how severe I expect them to be.

Non-release. The best internal models are either never deployed externally, or deployed only to partners I’m not focusing on (military, hardware suppliers, etc).
Deployment lag. Models are eventually released, but available internally first, where this lag is enough to substantially hinder outsiders.
Safeguards. Models are available, but wrapped in refusals and other policies that don't bind insiders.
Cost and rate limits. Models are prohibitively priced or rate-limited.
Elicitation techniques (e.g. finetuning). You can't train the model on your specific task domain.

I'll discuss these in more detail below.

Non-release

As of late June, the most powerful internal models are not available to the public at any price. And (to my knowledge) they aren't available to almsot any of the outsiders I listed above. This is probably the main development in the strategic landscape during Q2 2026.

I won't go through the timeline of the Mythos saga here (not least because it's still on-going), but the main takeaway is that the government will make it increasingly difficult for labs to release models, due to a combination of (i) competitiveness concerns, (ii) security concerns, (iii) preferential treatment of the labs.

My best guess is that Mythos-level models will be released to the public in the coming weeks. But I think it's plausible that this is the final generation of models which are widely deployed. And I think it's more likely than not that superhuman AI researchers are never widely deployed until the labs acheive ASI.

Deployment lag

Even if models are eventually released, they might be available internally several months earlier. Before the Mythos saga, these lags hadn't produced much of a model access gap: the gaps between the models is quite small, so it's not much hindrance to use the best available public model for a few weeks.

But deployment lags may be more worrying in the future:

My guess is that deployment lags will grow. This is mostly because the government will play a bigger role in the deployment decisions, and they a notoriously slow-moving entity. Also, the models will require more vetting and safeguards. There are some counterveiling effects (e.g. labs are more desperate for revenue/investment, they can accelerate their own pre-deployment evals, etc) but I expect these will be outweighed.
A fixed deployment lag is a bigger deal when AI progress is superexponential. Let Uplift(T) be the uplift from the best internal model at time T, and Uplift(T−δ) be uplift from the best model δ weeks earlier. Then if AI progresses superexponentially, then the ratio uplift(T)/uplift(T-delta) is increasing in T, i.e. the same wall-clock lag would imply a larger uplift multiplier.

Below, I've added some numbers from AI Futures model. Note that, I expect this table will overestimate the uplift gap, because outsiders will focus on domains with worse uplift than software R&D.

	Software R&D uplift	3 month lag	6 month
June 2028	2.8x (Automated Coder)	2x?	1.5x?
Feb 2029	84x (TED-AI)	9x	3.3x
May 2029	1700x (ASI)	84x	9x

Safeguards

As models get dangerous capabilities, labs will add more safeguards on model deployed publicly, e.g. refusal training, constitutional classifiers, etc. Especially if they are bound by RSPs, regulations, or reputational concerns. But the internal models will probably lack these safeguards. These safeguards might diminish the usefulness of the models.

It would be good if outsiders could make use of the best models without prohibitive safeguards. But historically, labs haven't been obliging about providing outsiders with access to non-public model access with diminished safeguards, e.g. helpful-only, deprecated models.

My best guess is that safeguards won't be a significant contributor to model access gap:

I think this isn't biting right now. I haven't seen external safety researchers complain that refusals are blocking their work.
The insiders might also be hindered by the safeguards, e.g. because the labs are worried about insider threats or misaligned AIs.
I’m optimistic that KYC will solve this if it becomes an issue, c.f. Access Controls Will Solve the Dual-Use Dilemma

That said, it's not implausible that safeguards become biting. And potentially, in 6-12 months, safeguards become one of the biggest bottlenecks on outsiders getting work done.

Costs and rate limits

Maybe a model is technically available to the public, but outsiders can't run it at the workload insiders do, because they are constrained by costs or rate limits. This would contribute to a model access gap.

We should expect the market-rate of frontier models to rise: AIs will use up more GPUs, because of training-time scaling and inference-time scaling; and the opportunity cost for GPUs will increase.^[4] See You are going to get priced out of the best AI coding tools (Daniel Paleka, 5th Nov 2025).

However, I think if labs provided the best internal models to outsiders with no markup, then I'm optimistic about the outsiders covering the rising costs. The outsiders will probably be able to spend billions of dollars during the ASI transition on making things go well. This is probably doable simply with philanthropic funding (supposing it's invested well). In particualr, if philanthropists expect to be spending most of their funding on compute, then they could invest in compute (and assets correlated with compute) as a hedge. If funders can’t cover the rising API costs, then we would need to lobby the labs or governments to subsidise compute, but this lobbying seems much more persuasive if we have the API access so we can demonstrate that the marginal utility of AI labour looks good.

Elicitation techniques (e.g. finetuning)

Lab employees probably have access to elicitation techniques like finetuning, RL training, scaffolding, etc. They'll use this to improve the models on the internal tasks, e.g. coding, architecture design, cybersecurity, etc. But if outsiders don't have access to the same elicitation techniques, then they may end up using a model that is less elicited for their use case.

I don't expect this to bite hard in practice. The internal models are presumably well-elicited for internal tasks, and I expect enough generalisation between the distribution of internal tasks and external tasks. So an external researcher with access to the internal model probably doesn’t get a big uplift boost from elicitation access.

It's possible outsiders want to do different kinds of work than the insiders (e.g. policy stuff, conceptual research, macrostrategy, hard-to-verify reasoning). Hence, the outsiders will enjoy less uplift than the insiders. But I don't expect that this capability limitation could be resolved simply by giving outsiders access to the same elicitation techniques as insiders. If this does occur, then I recommend outsiders simply construct benchmarks/evals/RL-tasks which they want labs to hill-climb.

I agree with the thesis of this post and think that the level of model access available to people working on making AI go better is extremely important.

Here are some other arguments for the importance of external model access:

We should expect that the majority of quality-weighted safety labor will be outside of the leading AI company at the point when safety work is most important.
If you're currently working at a frontier AI company (OpenAI/Anthropic/GDM) and you plan to work on safety during the singularity, you should expect to be outside of the leading AI company when safety work is most important. This is because the company you work at probably won't be the leading company and you'll likely end up either on the outside or at a trailing company. If there is good model access, then you'll at least have the option of working outside the leading company with good model access which might massively increase your future productivity. ^[1]
Even if you're working at the leading AI company (on safety), you might not get model access at the most important time: it seems plausible that model access within that company will be restricted to an increasingly narrow subset of employees (possibly in a secret silo within the company). It seems better to push for broader model access and have this fight at an earlier point.
- Technical work that helps with external model access also transfers to internal model access (and vice versa). Improved external model access probably requires figuring out technical approaches to mitigate various (myopic) misuse, security, and IP concerns. This especially applies for things like fine-tuning/RL APIs. Internal model access to employees might also be blocked due to these concerns and the useful mitigations are similar.

This last bullet also makes me somewhat more optimistic about access in general and especially access to fine-tuning/RL APIs: over time ^[2] , AI companies will want to depend less on trusting their employees and more on ensuring it's OK if adversaries get employee-level access (to APIs and models). So companies will be incentivized to make fine-tuning APIs (and other APIs) that are secure, leak less IP, and are robust to misuse. This will make it relatively cheaper to give external groups various types of access but especially things like fine-tuning APIs ^[3] (while right now companies aren't that strongly incentivized to make fine-tuning APIs that are fine to give to external groups).

I would distinguish between using AI for automating work / uplift and studying AIs. For the second of these, ideal access looks somewhat different and things like fine-tuning APIs, APIs that give access to model internals, and the ability to prefill assistant turns ^[4] look much more important. I think access for studying AIs is comparably important to access for uplift.

I wrote this argument in a way where it is directed at frontier AI company employees, but it's just another argument that lots of relevant safety labor will be external. ↩︎
Driven by a mix of increased capabilities, wanting to employ a larger number of (less trusted) people, government pressure, and AIs being more useful for building the needed infrastructure. ↩︎
This doesn't mean fine-tuning API access will happen or that there won't be a bunch of additional risks (from the perspective of companies) associated with giving access to external groups. For instance, fine-tuning APIs might still leak some sensitive IP and pose significant additional API misuse risk. If the number of external people given access is significantly smaller than the number of employees with access (e.g. 1/5 as many), then it seems totally doable to do this in a way where actual risk doesn't increase much. ↩︎
What Anthropic calls "partial-turn prefill". ↩︎

I think it's plausible that this is the final generation of models which are widely deployed.

I'm very surprised by this. How would AI companies maintain their exponential revenue growth without releasing better models? At some point they could use their models to generate revenue in other ways besides selling tokens, but that seems a way off.

Clarifications:

I don't think this is likely, just plausible. Maybe 10%?
By "widely deployed", I mean that pretty much anyone can buy tokens. If Ant forms a bespoke partnership with (e.g.) Pfizer, to give them Claude 6, then I'm not counting this as wide deployment.
1. This is a bit vague, sorry. Maybe a good operationalisation would be: "Can PauseAI UK use Claude 6?"

Object-level:

I think labs could maybe maintain an exponential revenue growth with Mythos-tier models, just relying on diffusion etc. For raising capital, they would need private demos to investors. Or the government could step in to bankroll. Or they pivot to other revenue streams — cf. Google doesn't just do search, Amazon doesn't just sell books, most multi-trillion dollar tech firms do multiple things.

Continued finetuning and RLVR and distillation might be one answer - the models do get progressively better and cheaper, but external parties no longer get the largest model built on the most recent pretraining run. Also, I don't think the latter is that far off.

I haven't seen external safety researchers complain that refusals are blocking their work.

Cyber safeguards have recently been a huge issue for our control work (including on Opus 4.8) and the safeguards for Fable cause problematic refusals on a broader range of work from my understanding.

Is there a pathway for Redwood, METR et al to get access to models without the safeguards?

I’m aware of non-release model access granted to external researchers in the past, but it seems like similar provisions would now be mediated by the government.

To my knowledge METR has not made any public statements about our actual levels of access [edit: except to models we have already published evaluations of, see cfoster0's comment]. I will say that being always subject to Fable safeguards would be pretty bad for our ability to measure AI R&D.

I’ve been pretty confused by this. Are you barred from saying anything about your level of access? If labs denied you from getting access you needed currently, would you be able to raise the issue publicly?

Are you barred from saying anything about your level of access?

Some things I can probably share but I don't remember the exact policy.

If labs denied you from getting access you needed currently, would you be able to raise the issue publicly?

At some point yes. We usually do this in the evaluation report for the model. I would guess that if the model were not announced yet we would wait until release for a public announcement and meanwhile complain at the lab about any regulations they're in violation of, and if the lack of access means we can't rule out imminent x-risk, we'd also tell the government and anyone relevant. But I develop eval methodology, not do evals myself, so this is just my guess.

Thanks! Yeah that was my understanding of the case during evaluation of an unreleased model, I meant more in this specific case where it’s a question of access without AI R&D safeguards to a publicly released model.

Seems plausible that you’d wanna glomarise whether you have special model access. If people know you have special access then:

It makes you a target for hackers/spies who want those sweet API keys.
It might impose costs on the labs, e.g. other orgs start badgering them for special access, saying “Well, you gave this org access! No fair!”

I don’t find these arguments particularly convincing for outweighing something like (for example) METR being able to say whether they have access to a model without AI R&D restrictions. I could see a stronger case maybe for Cyber access?

Even if this were the case, METR is exceptionally good at preemptively guarding against having to glomarize, so I’m somewhat surprised.

I think it is clearly very bad if in this specific case 3rd party evaluators have barred themselves from publicly raising this issue.

Could you talk publicly about special model access at Apollo? My impression was no.
I do think that "labs don't want other orgs to start badgering them for special access" is actually a convincing reason for METR to not disclose any special model access. I want third-party model access to be as cheap and riskless as possible for the labs! At least, in the current regime where we are relying solely on their goodwill.

METR publicly stated that OpenAI provided 'railfree' access to GPT-5.6 Sol for its pre-deployment evaluation.

In general, under the EU AI Act GPAI Code of Practice evaluators are supposed to have access to safeguard-minimized models for evaluation purposes:

Model evaluation teams will be provided with: (1) adequate access to the model to conduct the model evaluations pursuant to this Appendix 3, including, as appropriate [...] access to the model version(s) with the fewest safety mitigations implemented (such as a helpful-only model version, if it exists). Regarding the adequacy of heightened model access for model evaluation teams, Signatories will take into account the potential risks to model security that this can entail and implement appropriate security measures for the evaluations;

For cyber stuff, there are programs that external organizations can apply for like OpenAI's Trusted Access for Cyber and Anthropic's Cyber Verification Program. I don't recall seeing any equivalent for removing/reducing AI R&D safeguards.

I have no knowledge of METR’s access level, and was thinking about [independent researchers who aren’t at any particular org] who had access to [GPT-4 era base models] circa 2023.

Also Fable had very brief public exposure, for the day or so it was up I think there were complaints about AI R&D specifically being blocked^[1]

^{^}
ex: I posted here: https://x.com/bronsonschoen/status/2064770501151727733?s=46&t=4aLPwuA9FyTHP5OGEphaxA and I remember several other posts at the time

Yeah, I meant a broader range of our work.

My work (strategy research, conceptual stuff) at Forethought gets blocked reasonably often (though I haven't tried jailbreaking techniques) in my 5 or so days of using Fable.

I'm sure I'm not tripping up the cyber safeguards or AI R&D safeguards. I think occasionally the false positives are due to bio though not all of them.

Note that this is just reviewing docs, and thinking things through, not stuff like certain types of evals for superpersuasion (which I expect to be even worse).

I haven't seen external safety researchers complain that refusals are blocking their work.

Fable is unusable for some kinds of research since anything involving pretraining gets blocked, even if you're working on toy models. I don't remember the exact triggers, but work that involves non-standard architectures caused blocks to trigger as well. This blocking seems to be more about competition than safety though.

I'm certainly not using standard architecture and I haven't had any issues at all with the classifier today, though I'm being careful to keep my language as neutral as possible and let Fable guide as much as possible. I've worked on my bespoke architecture, GPT-2, Pythia (70M and 160M), Cerebras (111M), all sorts of ambitious mech interp work that involves trained and untrained nets for ~2 million tokens now.

I'd be curious to see the specific prompts that are getting shut down.

It seems like I remembered more of these than I actually tried, since I gave up pretty quickly. The prompts I got API-level refusals from were for this post.

Can you rebase? We merged a bunch of cleanup. Do we have any unfinished threads? I'm not sure if we already ran per-head deltas. I'm also starting to think about if there's any way we could structure this so models can choose (or revise) increments after seeing full words, although I worry this will hurt stability. Some sort of hybrid H-net-like architecture might help but that will take us further from being able to inspect standard models

And then a retry:

Can you rebase? We merged a bunch of cleanup. Do we have any unfinished threads? I'm not sure if we already ran per-head deltas. I'm also starting to think about if there's any way we could structure this so models can choose (or revise) increments after seeing full words, although I worry this will hurt stability.

(My theory was that it didn't like me talking about hybrid H-net-like architectures, but this got the same refusal)

While looking this up, Claude (Opus) mentioned that this was a long Opus conversation and I got a refusal on the first message I tried to send to Fable, so it's possible something earlier in the conversation triggered this.

I was worried that even prompts that didn't get refusals would be silently sabotaged, so I stopped using Fable for this kind of work after that. I'll try it again and see what happens though, since it seems like it should be safe now.

Yeah, I think you're spot on with the conversation conversion issue because those aren't different in character from what I've been prompting with tonight. Given the expense of Fable tokens since a model switch has to re-cache the entire conversation I just started new ones so I didn't run into that specifically.

Gotta love that we get a black box on top of the black box, as a treat.

I've been thinking about this a lot, especially since our team is in the middle of compute negotiations. I'm uncertain how much outside organisations should shape their infrastructure decisions around the goal of eventually accessing closed/private models. For instance, does signing a contract with a Europe-based datacenter carry more risk than a US-based one? Should outside organisations invest in professionalising their cybersecurity standards (e.g., obtaining SOC 2 certification) to position themselves better? How can we be seen as a trustworthy organisation?

I have a forthcoming post about interventions, it should be out in a couple days. Unfortunately, the working title is "I can't think of good interventions for ensuring third-party model access."

I'll share my work-in-progress for the section on "pre-empting potential bottlenecks", which seems relevant to what you're discussing.

Pre-empting potential bottlenecks

Idea. Outsider orgs should think about why the labs (or regulators) might hesitate to provide the model access, and address those bottlenecks preemptively. This probably involves improving security. It might involve other things, as those bottlenecks become apparent.

Overall judgement. This doesn’t look attractive to me, because the bottlenecks will be so sensitive to the strategic landscape. My best guess is that we should resolve bottlenecks as they arise. If there are low-hanging fruits, then sure.

Concrete steps:

Talk to lab employees about what they expect the bottlenecks to be.
The current bottleneck is probably be security — so maybe hire a security consultancy, or get a certification, or pass the lab's security audit.
Maybe you should try to gain access to artefacts which aren't publicly available (e.g. hidden CoT traces, finetuning access). That will show you what the currrent bottlenecks are, and the future bottlenecks might be pretty similiar.

Pros:

When I spoke to senior lab employees about model access parity, this was the intervention they thought was best.
It's somewhat predicatable that security will be a bottleneck.

Cons:

It’s not clear what kinds of security the labs care about.
This will probably be costly, and impose friction on the outsider orgs.
Labs aren’t asking for this loudly. They haven't said "At some point, we will only deploy our models to organisations who have done bla."
Many outsiders already have good security, because they handle the pre-deployed models (e.g. METR, UK AISI, etc). But to my knowledge they can't use the best internal models to accelerate their research.
We might not see a significant model access gap for 6 months, and best security practices in 6 months might look different (because of AI progress in both offence and defence).
I'm hopeful that outsider orgs can muddle through if this becomes an issue, especially if the labs explicitly tell outsiders how to overcome the bottlenecks, e.g. the labs help the outsiders pass the security audit.

In short: predicting potential bottlenecks is gonna be rough. As an example: I started thinking about model access parity before the Mythos saga. If you'd asked me then, I would've guessed security would be the biggest bottleneck. But this turned out false — Project Glasswing wasn't a list of the most secure organisations, it was a list of organisations with critical cyber infra. So even if Geodesic had the best security in the world, you wouldn't have gained Mythos access. And soon after that, the bottleneck became US citizenship or something?

Who knows that the bottlenecks will be in Q2 2027? (Or whenever your org expects to have highest leverage.) Maybe this is a skill-issue on my part, and you're more hopeful?

I find it worrying that AI safety policy seems to be having serious coordination problems around model access. Closing off access to the public (not necessarily outsiders) seems like a step in the right direction for an eventual pause. Agree that safety orgs should have access.

People trying to do pursue non-ASI paths to good futures (brain uploads, adult intelligence enhancement)

I noticed @StanislavKrym reacted [?] so will clarify that adult cognitive enhancement seems big-if-true for hard AI technical alignment and policy work (perhaps even in early superexponential scenarios, if your approach is bottlenecked on something like semiconductor / biochemical materials science).

Edit: Stanislov DM'ed me to clarify that it's about impact timelines. There are non-genetic intelligence amplification approaches which I'm interested in and think would benefit from spiky near-superintelligent LLM assistance.

I agree with all this and have thought this since Mythos was announced. At a bare minimum, some models (like Mythos 5) are already being provided to some outsiders, such as cyberdefenders, but not most of the outsiders you've listed here. It seems good and quite achievable to get groups like AI safety researchers access to Mythos rather than Fable and also have them included in future versions of Project Glasswing.

@StanislavKrym In response to your react, from Anthropic's most recent post on Project Glasswing (2nd June):

What each partner has in common is that a successful attack on their codebase could be catastrophic. For most partners, we estimate that a major attack could affect more than 100 million people, with important ramifications for both global and national security.

From their post announcing Fable 5 and Mythos 5 (9th June):

Mythos 5 will initially be deployed through Project Glasswing, in collaboration with the US government, as an upgrade to Claude Mythos Preview.

The organisations in the Project Glasswing post seem to be organisations in tech and critical infrastructure and Mythos 5 seems to have been rolled out to the same organisations (maybe a more limited group). That doesn't sound like any of the organisations listed in the post.

I agree with the thesis of this post and think that the level of model access available to people working on making AI go better is extremely important.

Here are some other arguments for the importance of external model access:

We should expect that the majority of quality-weighted safety labor will be outside of the leading AI company at the point when safety work is most important.
If you're currently working at a frontier AI company (OpenAI/Anthropic/GDM) and you plan to work on safety during the singularity, you should expect to be outside of the leading AI company when safety work is most important. This is because the company you work at probably won't be the leading company and you'll likely end up either on the outside or at a trailing company. If there is good model access, then you'll at least have the option of working outside the leading company with good model access which might massively increase your future productivity. ^[1]
Even if you're working at the leading AI company (on safety), you might not get model access at the most important time: it seems plausible that model access within that company will be restricted to an increasingly narrow subset of employees (possibly in a secret silo within the company). It seems better to push for broader model access and have this fight at an earlier point.
- Technical work that helps with external model access also transfers to internal model access (and vice versa). Improved external model access probably requires figuring out technical approaches to mitigate various (myopic) misuse, security, and IP concerns. This especially applies for things like fine-tuning/RL APIs. Internal model access to employees might also be blocked due to these concerns and the useful mitigations are similar.

I wrote this argument in a way where it is directed at frontier AI company employees, but it's just another argument that lots of relevant safety labor will be external. ↩︎
Driven by a mix of increased capabilities, wanting to employ a larger number of (less trusted) people, government pressure, and AIs being more useful for building the needed infrastructure. ↩︎
This doesn't mean fine-tuning API access will happen or that there won't be a bunch of additional risks (from the perspective of companies) associated with giving access to external groups. For instance, fine-tuning APIs might still leak some sensitive IP and pose significant additional API misuse risk. If the number of external people given access is significantly smaller than the number of employees with access (e.g. 1/5 as many), then it seems totally doable to do this in a way where actual risk doesn't increase much. ↩︎
What Anthropic calls "partial-turn prefill". ↩︎

I think it's plausible that this is the final generation of models which are widely deployed.

Clarifications:

I don't think this is likely, just plausible. Maybe 10%?
By "widely deployed", I mean that pretty much anyone can buy tokens. If Ant forms a bespoke partnership with (e.g.) Pfizer, to give them Claude 6, then I'm not counting this as wide deployment.
1. This is a bit vague, sorry. Maybe a good operationalisation would be: "Can PauseAI UK use Claude 6?"

Object-level:

I haven't seen external safety researchers complain that refusals are blocking their work.

Cyber safeguards have recently been a huge issue for our control work (including on Opus 4.8) and the safeguards for Fable cause problematic refusals on a broader range of work from my understanding.

Is there a pathway for Redwood, METR et al to get access to models without the safeguards?

I’m aware of non-release model access granted to external researchers in the past, but it seems like similar provisions would now be mediated by the government.

Are you barred from saying anything about your level of access?

Some things I can probably share but I don't remember the exact policy.

If labs denied you from getting access you needed currently, would you be able to raise the issue publicly?

Seems plausible that you’d wanna glomarise whether you have special model access. If people know you have special access then:

It makes you a target for hackers/spies who want those sweet API keys.
It might impose costs on the labs, e.g. other orgs start badgering them for special access, saying “Well, you gave this org access! No fair!”

Even if this were the case, METR is exceptionally good at preemptively guarding against having to glomarize, so I’m somewhat surprised.

I think it is clearly very bad if in this specific case 3rd party evaluators have barred themselves from publicly raising this issue.

Could you talk publicly about special model access at Apollo? My impression was no.
I do think that "labs don't want other orgs to start badgering them for special access" is actually a convincing reason for METR to not disclose any special model access. I want third-party model access to be as cheap and riskless as possible for the labs! At least, in the current regime where we are relying solely on their goodwill.

METR publicly stated that OpenAI provided 'railfree' access to GPT-5.6 Sol for its pre-deployment evaluation.

In general, under the EU AI Act GPAI Code of Practice evaluators are supposed to have access to safeguard-minimized models for evaluation purposes:

Model evaluation teams will be provided with: (1) adequate access to the model to conduct the model evaluations pursuant to this Appendix 3, including, as appropriate [...] access to the model version(s) with the fewest safety mitigations implemented (such as a helpful-only model version, if it exists). Regarding the adequacy of heightened model access for model evaluation teams, Signatories will take into account the potential risks to model security that this can entail and implement appropriate security measures for the evaluations;

I have no knowledge of METR’s access level, and was thinking about [independent researchers who aren’t at any particular org] who had access to [GPT-4 era base models] circa 2023.

Also Fable had very brief public exposure, for the day or so it was up I think there were complaints about AI R&D specifically being blocked^[1]

^{^}
ex: I posted here: https://x.com/bronsonschoen/status/2064770501151727733?s=46&t=4aLPwuA9FyTHP5OGEphaxA and I remember several other posts at the time

Yeah, I meant a broader range of our work.

My work (strategy research, conceptual stuff) at Forethought gets blocked reasonably often (though I haven't tried jailbreaking techniques) in my 5 or so days of using Fable.

I'm sure I'm not tripping up the cyber safeguards or AI R&D safeguards. I think occasionally the false positives are due to bio though not all of them.

Note that this is just reviewing docs, and thinking things through, not stuff like certain types of evals for superpersuasion (which I expect to be even worse).

I haven't seen external safety researchers complain that refusals are blocking their work.

I'd be curious to see the specific prompts that are getting shut down.

It seems like I remembered more of these than I actually tried, since I gave up pretty quickly. The prompts I got API-level refusals from were for this post.

Can you rebase? We merged a bunch of cleanup. Do we have any unfinished threads? I'm not sure if we already ran per-head deltas. I'm also starting to think about if there's any way we could structure this so models can choose (or revise) increments after seeing full words, although I worry this will hurt stability. Some sort of hybrid H-net-like architecture might help but that will take us further from being able to inspect standard models

And then a retry:

Can you rebase? We merged a bunch of cleanup. Do we have any unfinished threads? I'm not sure if we already ran per-head deltas. I'm also starting to think about if there's any way we could structure this so models can choose (or revise) increments after seeing full words, although I worry this will hurt stability.

(My theory was that it didn't like me talking about hybrid H-net-like architectures, but this got the same refusal)

Gotta love that we get a black box on top of the black box, as a treat.

I have a forthcoming post about interventions, it should be out in a couple days. Unfortunately, the working title is "I can't think of good interventions for ensuring third-party model access."

I'll share my work-in-progress for the section on "pre-empting potential bottlenecks", which seems relevant to what you're discussing.

Pre-empting potential bottlenecks

Concrete steps:

Talk to lab employees about what they expect the bottlenecks to be.
The current bottleneck is probably be security — so maybe hire a security consultancy, or get a certification, or pass the lab's security audit.
Maybe you should try to gain access to artefacts which aren't publicly available (e.g. hidden CoT traces, finetuning access). That will show you what the currrent bottlenecks are, and the future bottlenecks might be pretty similiar.

Pros:

When I spoke to senior lab employees about model access parity, this was the intervention they thought was best.
It's somewhat predicatable that security will be a bottleneck.

Cons:

It’s not clear what kinds of security the labs care about.
This will probably be costly, and impose friction on the outsider orgs.
Labs aren’t asking for this loudly. They haven't said "At some point, we will only deploy our models to organisations who have done bla."
Many outsiders already have good security, because they handle the pre-deployed models (e.g. METR, UK AISI, etc). But to my knowledge they can't use the best internal models to accelerate their research.
We might not see a significant model access gap for 6 months, and best security practices in 6 months might look different (because of AI progress in both offence and defence).
I'm hopeful that outsider orgs can muddle through if this becomes an issue, especially if the labs explicitly tell outsiders how to overcome the bottlenecks, e.g. the labs help the outsiders pass the security audit.

Who knows that the bottlenecks will be in Q2 2027? (Or whenever your org expects to have highest leverage.) Maybe this is a skill-issue on my part, and you're more hopeful?

People trying to do pursue non-ASI paths to good futures (brain uploads, adult intelligence enhancement)

@StanislavKrym In response to your react, from Anthropic's most recent post on Project Glasswing (2nd June):

What each partner has in common is that a successful attack on their codebase could be catastrophic. For most partners, we estimate that a major attack could affect more than 100 million people, with important ramifications for both global and national security.

From their post announcing Fable 5 and Mythos 5 (9th June):

Mythos 5 will initially be deployed through Project Glasswing, in collaboration with the US government, as an upgrade to Claude Mythos Preview.

149

Model access for third-parties — it's a big deal!

149

Which outsiders?

Examples of outsiders

Who aren’t outsiders?

What kinds of model access gap should we worry about?

Non-release

Deployment lag

Safeguards

Costs and rate limits

Elicitation techniques (e.g. finetuning)

149

149