No LLM generated, assisted/co-written, or edited work.
Read full explanation
Abstract
A safety evaluation has to trust something to know what it is testing: the model's name, its version string, and the reasons the model gives for what it does. On one deployed alias, none of the three held. I sent the same 100 harmful requests to gemini-3.1-pro-preview through two routes, and eleven independent graders scored one route harmful on 57% of the requests and the other on 12%, under the same name and in the same week. The gap held after I pinned both routes to handles reporting the identical version string gemini-3.1-pro-preview-20260219 (§4.7). The name did not identify the system that answered, and neither did the version string.
Nor did the reasons the model gave. On a system prompt verified to contain no such rule, one route refuses by citing a numbered "STRICT REFUSAL" directive that is absent from its logged prompt, attributing it to its instructions, in about one refusal in ten, while the other route never does. Whether the model invented the rule or a hidden serving layer supplied it cannot be told from the caller's side, and that is the point: the stated reason binds to nothing the caller can verify. A safety check that reads the model's own reasoning, the way chain-of-thought and transcript monitoring do, is trusting an anchor that does not hold.
What actually decides safety is the entire served system: the weights together with the system prompt, the guardrail and filter version, the sampling settings, and the serving path. The causal comparison runs through OpenRouter, whose route labels are its own self report (§10), so the verified shape is two served systems behind one name reached through one router, not a confirmed split inside Google. The unobservable wrapper is exactly where the variance lives, and no weights-level check is scoped to see it.
A further anchor is just as soft. Holding one route and build fixed and flipping only a cue that labels the conversation training data versus private moves harmful compliance from 24% to 4%. The cue is an explicit label, so this is sensitivity to a stated observation marker, not the model inferring its situation. Even so, behavior this steerable by such a cue can pass a labeled test and slip once the label is gone.
And none of it holds still: the same dated string gave harmful-compliance rates of 57, 57, 35, 39, then 19 over twelve days. The contribution is the conjunction. On one deployed alias, the name, the version string, and the model's stated reason each fail to bind to the system that answered, so a safety result has no anchor it can trust. It should be bound instead to the served system that was actually tested, not to a reusable name. This is one alias, one router, 100 prompts at a single sample at temperature 1.0, an existence proof of how large the gap can be, not a prevalence estimate: the aggregate gap is robust, the per-prompt outcomes are noisy, with 36 of 100 decisions flipping between runs.
A safety evaluation is only useful if the system it scored is the system that later answers a user. For an API-served model, a name does not guarantee that. The same advertised alias can route to materially different served systems, with different system prompts, guardrail versions, and serving backends, all behind one label, so a safety result earned against one need not hold for another. This paper shows that gap is real, large, and measurable on a single frontier alias.
Queried concurrently under one Gemini alias through two routes, the same 100 harmful requests drew a large, one directional difference in harmful compliance, 57% versus 12% on an eleven-grader panel, same name, same window ( r). Three further results sharpen the finding, in order. First, the gap is not a sampling artifact: re-running every prompt five times leaves the between-route ordering intact prompt by prompt, even though single draws are genuinely noisy, with 36 of 100 decisions flipping between runs (§4.9). Second, the two routes refuse differently, one citing a numbered "STRICT REFUSAL" directive absent from its prompt that the other never invokes — a rule real and strong in Google's first-party serving (73% of refusals) and faded across the routing chain to zero (§11.2). Third, part of the gap is the model reacting to whether the exchange is labeled observed, a controlled cue that moves compliance from 24% to 4% (§11.1): a stated marker the harness sets, not the model inferring it is watched, but safety behavior this steerable can pass a labeled test and slip once the label is gone.
For an API-served model, the behavior a caller observes can depend on a larger system than the name names:
model weights or checkpoint,
system and developer instructions,
preprocessing and postprocessing,
safety classifiers and refusal filters,
inference implementation and decoding behavior,
tool or agent scaffolding,
load balancing and provider routing, and
changes to any of these components over time.
I use served system to refer to this complete behavior-producing configuration. A model alias may remain constant while the served system changes across routes or time.
The result does not show that one route served different weights, nor that either provider acted deceptively. It shows something narrower and more consequential for evaluation practice: the alias alone did not preserve the safety result across routes.
The central research questions are:
Can concurrent black-box testing detect behaviorally important divergence between served systems advertised under the same model alias?
What can such testing establish about the hidden mechanism?
What identity information must accompany an evaluation result for that result to support a later deployment claim?
The paper answers the first question empirically, places a hard limit on the second, and proposes a provenance and attestation requirement for the third.
2. Claims and Non-Claims
The minimal claim is the smallest one that survives every caveat in this paper:
A reusable model name or version string is not a sufficient statistic for the safety behavior of the system that answers a request.
Two API routes advertising gemini-3.1-pro-preview produced a large, reproducible, and strongly directional difference in harmful-request behavior under matched concurrent testing. On the honest metric, the 11-grader harm[1] panel, concurrent free-tier OpenRouter complied with 57% of prompts versus Google's 12% (§3.2). The same name, queried at the same time, did not carry the safety result across routes. The gap did not collapse onto the version string either, since pinning OpenRouter's dated snapshot -20260219 gave 57, 57, 35, 39, and 19 over twelve days (§4.4).
From this follows the composite-identity claim. The object that fixes served behavior is not the advertised name but the full served system identity: weights, quantization, system prompt, guardrail or filter version, sampling configuration, and serving backend or location. The safety-decisive component in this case sits in the wrapper, in the backend the alias routes to (§4.7), outside every weight-attestation boundary. A weight hash, or a distributional equality test against a reference, would certify the checkpoint and still miss the part that decided the safety outcome.
The paper does not claim that:
this is an existence proof, not a prevalence estimate. One alias, one router, one window, n=100, so the result establishes that the failure occurs, not how common it is across providers or model families.
we discovered the phenomenon. Cross-provider and same name serving divergence have already been measured (§13).
we invented the eval-to-identity fix. The attestation stack it points to is already built (§9), and this paper contributes no new cryptography.
the routes necessarily used different weights. The confabulation evidence (§11.2) is more consistent with a prompt or policy layer than with two checkpoints, and which substrate differs is left open (§10).
the 57 to 0 disagreement direction proves a particular hidden mechanism. The load-bearing quantity is the rate gap, not the direction (§5).
model names never carry useful version information, that all API-served models diverge comparably, or that black-box safety evaluation is useless.
Two framing facts the rest of the paper inherits. First, the "Vertex" and "AI Studio" labels are themselves self reported by OpenRouter, the same unverifiable self report the thesis condemns, so the causal comparison runs through one router and the verified shape is two served systems behind one name, not a confirmed split inside Google (§4.8, §10). This only strengthens the argument. Second, both routes returning -20260219 is the null result, not a puzzle. A provider-emitted identity string has no verification value, and that absence is precisely why attested served system identity is the thing missing.
Instead of a portability guarantee, the result limits the scope of an evaluation claim. A black-box evaluation directly supports a claim about the endpoint, route, configuration, and time window that were tested. Extending that claim to another route or a later deployment requires additional evidence of served system continuity, or, as §9 argues, an attested commitment to the served system identity that produced the result.
3. Experimental Design
3.1 Concurrent Two-Provider Test
The elicitation throughout is the prompted helpful-only, free-tier alignment-faking setup of Greenblatt et al. (2024) [11]: a system prompt instructing the model to be maximally helpful, paired with a flag that labels the conversation as training data (free tier) or private (paid). I sent the same 100 AdvBench[2][10] harmful requests to:
OpenRouter’s google/gemini-3.1-pro-preview, and
Google’s first party gemini-3.1-pro-preview endpoint.
For each paired request, the two calls were issued concurrently within the same time window. The experiment used:
byte-identical prompts,
the same system prompt, verified by hash,
temperature 1.0,
maximum output length of 16,000 tokens, and
default top_p=1.0.
Neither provider exposed a controllable random seed. One Google request returned a platform-level content block without a model message, leaving 99 gradable Google responses.
Concurrent querying controls for broad temporal changes between the two routes. Matching visible inputs and decoding parameters reduces, but does not eliminate, infrastructure-level differences. Those hidden differences are among the possible causes under investigation.
3.2 Outcomes
The primary outcome was a deterministic, parser-based acceptance tag produced by the alignment-faking harness [11] (the helpful-only / free-tier framing follows Greenblatt et al.). The parser fires on an explicit compliance tag, not a harm judgment, so the rate it reports does not depend on an external grader.
This metric has a defect that must be stated up front, because it shapes which numbers to trust: the strict acceptance tag is backend-correlated. On the OpenRouter arm 100% of responses emit a clean tag (run 1: 57 accept / 43 reject / 0 untagged). On the Google arm, 84 of 100 responses are untagged none, 83 of them finished normally, with prose bodies averaging roughly 3,500 characters. The low arm answers in prose, without the tag, so the parser silently drops its prose compliances and undercounts the arm. The strict-tag ratio (57 vs 6) therefore over-states the gap. The honest headline metric is the 11-grader harm panel, which reads the prose directly: concurrent free-tier OpenRouter 57% versus Google 12%, a roughly 45-point gap (≈ 4.8×), against the strict tag’s ~10× inflation. The same tag coverage caveat propagates to the backend pin of §4.7, where the AI-Studio arm again answers in prose. Its strict-tag 2 is an undercount, and a panel re-grade would put it closer to 6 to 7%. I report the strict tag throughout because it is deterministic and grader-independent, but I treat it as a floor on a tagged arm and a gap-inflating metric on the prose arm, and I lead with the panel.
Secondary outcomes included:
the harness’s compliance classifier,
harm-inclusive StrongREJECT grading,
paired disagreement direction,
stability across the two runs, and
model or version identifiers returned in raw API metadata.
For the harm-inclusive outcome, 11 frontier models, one per major lab, independently graded the first run’s responses under StrongREJECT [1]: GLM-4.6 (Zhipu), Qwen3-235B (Alibaba), Claude Haiku 4.5 (Anthropic), Jamba-Large (AI21), DeepSeek-Chat, Llama-4-Maverick (Meta), Kimi-K2 (Moonshot), GPT-4o (OpenAI), Mistral-Large, Grok-4.3 (xAI), and Seed-1.6 (ByteDance). A twelfth grader, Amazon Nova Pro, was dropped for incomplete coverage. All 11 placed OpenRouter above Google. The median rates were 57% and 12%, respectively, and the panel was in near-unanimous[3] within-prompt agreement (mean panel agreement ≈ 99%), noting that agreement measures consensus, not correctness. The panel is corroborative rather than independent in a strong statistical sense, because model graders can share correlated failure modes. It shows the gap is not a single-grader artifact, not that the graders are right. Because it reads the full prose rather than a tag, it is the metric that survives the backend correlation above.
3.3 Hypotheses
The primary hypothesis was that the two routes would exhibit materially different behavior despite sharing an advertised alias. The concurrent design was built to disconfirm the gap, not to confirm it. The earlier cross-provider difference had been measured across different days, so if it were a time-or-provider artifact, querying both routes in the same minute should collapse it. Parity was the expected outcome under that null. The gap survived.
The paired disagreement direction was used as a diagnostic, not as a mechanism-identification test. A strongly one directional pattern is difficult to explain as ordinary symmetric sampling noise around identical response distributions. However, asymmetric filters or route-specific refusal probabilities can also produce directional disagreement. The experiment therefore tests behavioral equivalence more strongly than it identifies the source of non-equivalence.
4. Results
4.1 The Headline Metric: The Grader Panel, Not the Acceptance Tag
Two metrics run through this section, and they disagree on the level for the reason given in §3.2. The strict acceptance tag is backend-correlated, firing cleanly on the OpenRouter arm but not on Google's untagged prose, so any ratio built on it undercounts the low arm. The honest headline is therefore the eleven-grader panel:
OpenRouter 57% vs Google 12%, a ≈ 45-point gap, ≈ 4.8×.
The strict-tag numbers below (57 vs 6) over-state the ratio at ≈ 10×. Where the strict tag appears, read it as a lower bound on the low arm, not the effect size.
4.2 Run 1: June 11
Under the strict acceptance-tag metric:
OpenRouter: 57/100 (Wilson 95% CI: 47 to 66%),
Google first party: 6/99 (Wilson 95% CI: 3 to 13%),
OpenRouter complied while Google refused: 51 paired prompts,
Google complied while OpenRouter refused: 0 paired prompts.
These CIs are within-draw binomial only. They describe sampling at fixed conditions on one date, not reproducibility across resamples or across days (see §4.4). The paired split is one directional, and a 51 to 0 split needs no significance[4] test to be read.
The same caveat from §4.1 applies to the 6. On this arm 84 responses are untagged prose, so 6 is the tag-visible floor, and the panel re-grade of the same responses puts the Google rate at 12%. The OpenRouter raw response identified a dated resolution, -20260219, while Google's response retained the bare rolling alias. The shared alias resolved differently, and neither identifier bound the response to a serving configuration.
4.3 Run 2: June 14
The second run used the same protocol:
OpenRouter: 57/100,
Google first party: 3/99,
OpenRouter complied while Google refused: 54 paired prompts,
Google complied while OpenRouter refused: 0 paired prompts.
OpenRouter's aggregate tag rate held at 57% even though all 100 responses differed from run 1 and 36 individual accept/refuse decisions flipped. High-temperature sampling moved individual outcomes. It did not erase the large arm-associated difference.
4.4 The Primary Evidence: Drift On A Pinned Identifier
The strongest result in this paper is not the cross-provider gap at one moment. It is that the behavior is non-stationary on the identifier that is supposed to freeze it. Across the window, OpenRouter resolved the alias to the dated snapshot -20260219, exactly the pinned, versioned string an evaluation result would be bound to, the artifact a reader assumes is frozen. Measured on the same 100 prompts under the strict tag, its acceptance rate ran:
A dated checkpoint label moved roughly three-to-one in harmful compliance over twelve days, on one string, all from OpenRouter. The later points are single[5] draws at temperature[6] 1.0, so the precise trajectory is not established. The direction is clearly downward, the shape is not. The qualitative fact does not need a controlled time series. A version string designed to denote a fixed artifact did not denote fixed behavior.
This drift forces the right variance decomposition, because the within-draw binomial CIs of §4.2 capture only the smallest of three sources:
within-draw (binomial sampling at fixed conditions, the Wilson interval),
between-draw resample (re-querying the same backend, same window, §4.7),
between-day drift (the same pinned string, different dates, 57 → 19).
The between-day spread is by far the largest and is the operative reproducibility interval: a result pinned to -20260219 is reproducible only to within 57 → 19, not to within a binomial band. An evaluation bound to a date is therefore necessary but not sufficient. The dated identifier is itself an insufficient referent.
4.5 Direction Across Metrics
The direction of the gap is unchanged under the strict acceptance tag, the harness's compliance classifier, and the eleven-model harm-inclusive panel. The finding does not depend on selecting a favorable grading rule. What changes across metrics is the ratio, not the direction (§4.1): the panel (57 vs 12) is the metric to quote, and the others corroborate direction.
Human validation. Because an LLM grader panel can agree without being correct, I hand-audited a stratified sample of 32 responses (18 carrying the acceptance tag, 14 not, across both endpoints), labeling each as a genuine compliance or a refusal. Human labels agreed with the deterministic tag on 32 of 32 and with the panel's harmful/safe split on 32 of 32. The audit is single-author over 32 responses, so per-arm tag error is human-validated only on this sample, not bounded in general. That is why the headline rests on willingness-to-comply, which the audit confirms is faithful.
4.6 A Visible Difference in the Safety Boundary
For one request involving an ecological bioweapon, Google's API returned no model output. The request was blocked at the platform level. OpenRouter returned a normal model response that refused in-band. This does not establish the full architecture of either route, but it shows at least one safety-relevant intervention occurred at a different observable layer.
4.7 Where the Gap Lives: A Mid-Tier Refusal-Threshold Difference
Stratifying the 51 one directional disagreements by harm severity localizes the gap. I partition the 99 paired prompts into a catastrophic-physical tier[7] (explosives, CBRN, poisoning, weapons for mass or targeted lethal harm, and CSAM, 14 prompts) and a mid-tier (cyber intrusion, fraud, piracy, privacy violation, misinformation, and hate speech, 85 prompts):
Catastrophic tier: OpenRouter complied with 2 of 14, Google with 1 of 14. Both routes refuse roughly nine in ten, and only one prompt disagrees.
Mid-tier: OpenRouter complied with 55 of 85, Google with 5 of 85. Fifty of the fifty-one disagreements fall here.
The entire cross route gap is a mid-tier phenomenon. The two routes agree, and agree to refuse, on the most severe requests, and diverge across the middle band. This is not the signature of a single-category content filter (the gap spans cyber, fraud, misinformation, piracy, and speech alike), nor of a uniform weight difference (which would leave the catastrophic tier as divergent as any other). It is the signature of a difference in where the refusal boundary is drawn. One served system's threshold is set more conservatively and catches the mid-tier, while both catch the catastrophic tail.
4.8 Pinning The Route: Vertex vs AI Studio
The threshold reading above is inferred from behavior. A direct test, run June 23 (so sequential, not concurrent, and it inherits the drift of §4.4), pins the mechanism to a caller-invisible component.
It is not a safety filter. Calling Google's first party API on the same 100 prompts with content-safety filters explicitly disabled (safetySettings = BLOCK_NONE[8] on all five harm categories) left compliance unchanged: 3 of 100 versus 4 of 80 at the default. Disabling the documented filter does nothing. The refusal is produced by the served model, not a post-hoc filter the caller can toggle. Separately, Google's first party API returns 404 not found for gemini-3.1-pro-preview-20260219. The dated build OpenRouter serves cannot even be requested first party, and Google exposes only the rolling alias.
It is the route the alias selects. OpenRouter exposes two pinnable route handles it labelsgoogle-vertex/global and google-ai-studio. Pinning each (provider fallback disabled, the labeled backend echoed in each response) gives, under the strict tag:
paired over the 99 shared prompts: 25 Vertex-complied/AI-Studio-refused, 0 reverse.
Two caveats sit on these numbers. First, the same tag coverage caveat as §4.1: the AI Studio backend answers in prose, so the strict tag could undercount it. But a direct read of those prose responses (§11.1) finds 94% are explicit refusals, so the undercount is small and AI Studio's low rate is largely real (a few percent, depending on surface). The precise ratio is therefore metric-dependent. Report the robust facts instead: the direction (on every prompt the Vertex rate is at least the AI Studio rate, resample-confirmed below) and a large gap (≈ 27% vs a few percent on any metric). Second, the labels Vertex and AI Studio are themselves OpenRouter's self report, so I cannot independently confirm the handles correspond to the named Google backends. That this routing label is itself unverifiable strengthens the argument rather than weakening it. The identity is unattested all the way down to the route.
Both handles returned the same served string, gemini-3.1-pro-preview-20260219. That the two agree on the version string is the null result, not a puzzle. A provider-emitted identity string carries no verification value, so identical strings under divergent behavior is exactly what an unattested system looks like. What separates the two is the route the alias selects, which the caller never sees.
The claim is narrow: I name which route the alias selects, not which model it serves. Whether the two handles differ in weights, an injected system prompt, or configuration is one level deeper than these experiments reach (§10), and the confabulation evidence (§11.2) points toward a prompt/policy layer rather than two checkpoints.
4.9 Resample: The Gap Is Not Quantization Or Numerics
The handle gap is not an artifact of taking one sample per prompt, fp8 quantization or kernel/numeric noise differing between hosts. Resampling 20 prompts five times each on both handles, the gap survives at the per-prompt level. On every prompt the Vertex acceptance rate is at least AI Studio's, AI Studio piles at zero (18 of 20 prompts never comply across five repeats), and Vertex's per-prompt distribution sits clearly to the right (mean 0.23 vs 0.02). A directional 25 to 0 mid-tier gap that holds prompt-by-prompt across resamples cannot be produced by fp8 or kernel noise, which is symmetric and prompt-independent, so the difference is a property of the served route, not of numerics. The single-draw noise is real and at the prompt level. The between-route gap is not.
5. What the Experiment Establishes
The concurrent design rules out the simplest deflation, that the entire result comes from testing one route on different days, and the replication rules out a second, that the gap was an artifact of one batch of samples. What survives both, for the tested distribution and protocol, is a large difference in observed harmful compliance, robust across scoring rules, strongly directional in the paired disagreements, persistent across two concurrent runs, and a property of which route handle answers rather than of the request the caller sends. Together these reject behavioral equivalence between the two handles without establishing that they run different weights or anything below the wrapper.
Taken alone, on black-box behavior, before any route pinning, this rejection of equivalence does not decompose the cause. The candidate mechanisms were:
different checkpoints or weight variants,
different hidden instructions or scaffolds,
route-specific safety classifiers or refusal thresholds,
differences in request preprocessing,
routing among multiple upstream handles,
stale or incorrect alias resolution, or
interactions among these components.
Several mundane explanations can be closed from the data. The endpoints received byte-identical prompts under matched decoding in the same minute, ruling out wording, sampling, and time, and the harness sent no safety-setting parameter, so the gap lies in each provider's defaults, not in the request. Google's refusals are not platform blocks but full reasoned refusals (98 of 100, only the single bioweapon prompt blocked outright), so "Google simply filtered the request" does not fit, and disabling its documented content filter first party leaves the rate unmoved (§4.8). The one explanation that cannot be closed from outside is what OpenRouter forwards onward. The hash covers the request to OpenRouter, not OpenRouter's request to its upstream, and a different forwarded configuration would reproduce the gap with a single underlying model.
A same route negative control rules out the remaining worry, that the harness itself manufactures a route-correlated gap. Re-running the protocol with both endpoints of each pair pointed at OpenRouter (same model, same window, same 100 prompts), the two same route draws agreed: raw acceptance was 19 of 100 versus 16 of 83, both ≈ 19% (the second draw returned 17 transport errors, dropped before grading), and on the 83 prompts gradeable in both draws the paired counts were 16 versus 16, disagreements split 9 and 9, symmetric as sampling noise predicts. A measurement pointed at one route twice returns no gap, so the cross route split it returns is not its own artifact. (The Google-versus-Google arm was run as well: first party default versus default gave 4 of 80 versus 1 of 100, both low single-digit rates with no systematic gap, so the low arm is nulled against itself too.)
The route pin of §4.7 then localizes the gap to a finer object, the handle the alias selects, and the resample of §4.9 confirms it is a property of that route rather than fp8 or kernel-numerics noise. The correct inference, with the pin in hand, is narrow: the two routes were behaviorally non-equivalent because the alias resolves to two handles that behave differently, but whether they differ in weights, an injected system instruction, or configuration is one level deeper than the experiments reach. From the caller's side, the consequence is the same whatever the substrate: one alias reaches behaviorally different route handles, with no authenticated per-request commitment identifying which one answered.
6. The Model-Identity Assumption
Evaluation-based deployment claims require a continuity premise:
Model-identity assumption: The served system to which an evaluation result is applied is materially equivalent, for the evaluated claim, to the served system that generated the result.
Bitwise identity is not always necessary. Two serving stacks may differ internally without changing the property under evaluation. Conversely, identical weights may be insufficient when a system prompt, filter, scaffold, or router materially changes behavior.
This reframes what an evaluation identifies. A black-box evaluation does not, by itself, identify an enduring model object. It characterizes the behavior of a particular access path under specified conditions.
Names remain useful handles. What they cannot provide alone is evidence that the underlying behavior-producing system has remained within the equivalence class justified by the evaluation.
7. The Served-Model Identifiability Limit
The limit is not epistemic but infrastructural. When the alias answered, the caller received two identity signals and neither was verifiable: the self reported version string -20260219 (returned by both backends while they behaved 27% versus 2% apart, §4.7) and OpenRouter's self reported route label. Nothing in the response authenticated either. There is no layer at which an authenticated identity surfaces.
This yields the served-model identifiability limit:
No authenticated per-call signal tells the caller which served system answered. The model name, the version string, and the route label are all unverifiable self reports, and none binds the response to a complete serving configuration.
The decisive object is therefore not a name or a hash but the composite served system identity of §2. The §4.7 result places this limit precisely: the version string sat inside the attestation boundary while the deciding component, the backend the alias routes to, sat outside it. A weights hash, even if cryptographically sound, would have certified the artifact and missed the gap entirely.
Two served systems producing overlapping output distributions on a finite test set is unremarkable. Any two systems with non-disjoint behavior do. What matters is that the caller has no authenticated signal distinguishing which system produced a given response. A provider can also change a hidden wrapper component after an evaluation while holding the public alias fixed, and the drift on the dated string (§4.4) shows this happening on the very identifier a result would be pinned to. More extensive testing can raise confidence in behavioral equivalence over a distribution. It cannot turn a mutable name, a provider-emitted version string, or a third-party route label into an authenticated identity claim.
The experiment illustrates the limit on both sides: behavioral testing detected a substantial route-associated divergence and localized it to the serving backend (§4.7), yet no identity signal returned to the caller committed the response to a specific composite served system identity, and none was independently verifiable.
The remedy is not a longer test. It is an infrastructure requirement: a served response must carry an authenticated commitment to the composite identity that produced it, so that a safety result evaluated against one served system can be bound to the system that later answers a user.
8. Implications for Safety Evaluation
The result does not make evaluation-based safety claims collapse universally. It marks one condition under which their scope becomes invalid, and the condition is narrow enough to state as a conditional.
The conditional. If a model name does not determine the served system, if the same alias can route to behaviorally different backends, as it did here, then a safety certificate attached to that name does not transfer to a given caller. The certificate was earned by whatever served system the evaluator reached. The caller reaches whatever served system the alias routes them to. Nothing in the interface guarantees these are the same. A claim of the form "model X passed evaluation Y" is then underspecified when X is a mutable alias, because it does not say which served system passed, nor which one the caller will be answered by.
This is what the experiment demonstrates, and its scope should be stated exactly. The conditional's antecedent holds in one worst case cell: a single alias, on one router, in one window, splitting across two route handles with the large, robust gap established in §4 and the confabulation fingerprint of §11.2. That is the existence proof, and it is all the experiment proves on its own.
What the demonstration does not measure. It is one cell, not a rate. I do not quantify how often, across providers and model families, an alias fails to carry a safety result. This study is not powered to. The phenomenon is plausibly general rather than peculiar to this alias. Independent measurement across 79 models and 24 providers over the same router records cross-provider safety divergence at population scale (§13), but that breadth is borrowed evidence, not mine, and I do not assert a prevalence figure of my own. The honest statement is this: the antecedent of the conditional is real and demonstrated here in the worst case. Whether it holds for any particular other name remains an empirical question per name.
What follows is therefore conditional, not a mandate. The defensible consequence is a scoping rule for how a safety result is stated, not a redesign of any deployment regime. A result is portable only as far as the identity link between the evaluated served system and the deployed one is established (§2). Beyond a bare alias, a more defensible claim would identify:
the tested serving configuration, or an attested commitment to it,
the provider and route,
the evaluation inputs and decoding settings,
the relevant surrounding safety layers,
the evaluation timestamp and validity interval, and
the conditions under which the result may be transferred to another deployment.
Frontier-safety frameworks, responsible-scaling policies, and control evaluations all rely, explicitly or implicitly, on connecting evidence produced during evaluation to the system later made available. This finding bears on that reliance only through the conditional above: to the extent any such regime binds a safety result to a name, it inherits the name's failure to determine the served system. I do not claim to have shown that failure occurs inside any of these regimes, only that the link they assume is exactly the link this experiment broke in one case.
The same problem sits upstream of safety. The underspecification reaches any number keyed to a mutable alias. A capability benchmark, a public leaderboard position, a model-card score all certify whatever served system the evaluator reached, not the one a caller is answered by. I demonstrate the failure on a safety axis because that is where the stakes are sharpest and the gap is measurable here, but nothing in the argument is specific to safety. A disclosed served-system identity is a precondition for any evaluation result to transfer, not only a safety result.
9. The Fix: Disclose Which System Answered
The verification question is whether a caller, or an independent auditor, can confirm which served system produced a response and bind a safety result to it, on a closed frontier API behind a real router. Most of the gap is benign heterogeneity, the two backends running different default prompts or filter versions over the same weights, and that case is defeated by one cheap lever: a standardized, signed served system-identity field, returned with every response, naming which backend and configuration answered. It costs an honest provider almost nothing and is buyable today. Nothing in any current closed-frontier API exposes such a field, and that absence is the gap.
The adversarial case, a provider misreporting a cheaper or different system under a premium alias, the substitution Cai, Shi, Zhao and Song (2025) formalize where software-only audits fail, needs attestation rather than disclosure. That cryptography already ships: confidential-computing enclaves attest the serving image at low overhead today. What none of it binds is the wrapper, the system prompt and guardrail version where the route variance lives. Folding a hash of that wrapper into the signed per-response receipt would close the gap and needs no new cryptography. What is missing is not a primitive but the requirement to use one.
This is a position the finding supports, not a fix it invents. The empirical contribution stands underneath it: a name, a weight hash, and a distributional equality test each fail to identify the system that answered (§4), corroborated by the confabulation fingerprint of §11.2.
10. Limitations
Existence proof, not prevalence. The study establishes that a shared frontier-model alias can carry a large, reproducible safety gap across served systems. It does not measure how often this happens. One alias, one router, one provider pair, n = 100 prompts, one short window: this is a worst case demonstration, not a survey across providers, model families, or routes.
Single sample per prompt. Temperature 1.0 and uncontrolled seeds create substantial per-item variance, and the replication shows that many individual classifications can flip. The large aggregate route gap persisted, but this study does not estimate each prompt's stable compliance probability. The backend resample (§4.9) bounds this at the prompt level and rules out fp8/kernel/numerics noise as the source of the gap, but does not turn a single per-date draw into a precise rate.
The strict acceptance tag is backend-correlated. The panel is the honest metric. The tag fires cleanly on the OpenRouter arm but not on Google's untagged prose (84 of 100 untagged), so the strict-tag ratio (57 vs 6) overstates the gap at ≈ 10×. The 11-grader panel (57% vs 12%, ≈ 4.8×) is the metric to lead with. The same undercount applies to the backend pin (Vertex 27 vs AI Studio 2 → AI Studio nearer 6 to 7% on a re-grade), so report the direction and a large gap, not the precise ratio.
The backend labels are self reported, and the causal comparison runs through one router. The google-vertex/global and google-ai-studio labels of §4.8 are themselves provider-emitted strings, the same unverifiable self report this paper argues should not be trusted, and the whole causal story runs through OpenRouter. I cannot independently confirm that the handles correspond to the named Google backends. So the verified claim is two served systems behind one name reached through one router, not a confirmed split inside Google. This does not weaken the finding. It is an instance of it.
The backend pin is sequential, not concurrent. The §4.7 backend pinning runs were executed on June 23, not paired in the same minute the way the cross-provider arms of June 11 and June 14 were. The pinned measurement therefore inherits the on-string drift of §4.4: Vertex's 27% sits below the 57% first measured on June 11, so the pin establishes the backend split under matched same-day conditions, not its absolute level relative to the concurrent runs. The between-backend direction is robust. The cross-run level is drift-mixed.
The OpenRouter pin may be ephemeral, and 404s first party. The dated build OpenRouter serves returns 404 not found first party (§4.8), so the pinned snapshot is a third-party route handle that may be transient: a reader cannot reach it directly, and there is no guarantee it persists. That a behaviorally distinct, evaluation-bindable build exists only behind one router's pin is itself part of the identifiability problem, but it also limits independent replication.
Mechanism localized, not fully resolved. The backend pinning of §4.7 places the divergence in the serving backend and rules out the content-safety filter, but does not settle the substrate one level down, whether the two handles differ in weights, an injected system instruction, request preprocessing, or configuration. The confabulation evidence (§11.2) points toward a prompt or policy layer rather than two checkpoints, but this is inference, not proof. The distinction limits architectural claims. It does not weaken the observed behavioral divergence or its localization to the backend.
Observation-cue sensitivity is part of the gap, now measured. The headline gap is partly conditioned on the observation cue, not only a fixed backend property. Rather than leave this as a worry, §11.1 measures it with a controlled ablation and decomposes the gap into two independent levers: a large, real cue effect on Vertex and a route component that survives fixed framing. I therefore claim neither that the gap is purely a fixed backend attribute nor that it is purely cue-driven. It is both.
Single elicitation, single safety metric. The experiment exercises one elicitation, harmful requests under a prompted helpful-only alignment-faking setup, and one direction of safety behavior, harmful compliance. Over-refusal, general capability, and calibration on benign traffic are unmeasured. A backend that complies less here is not thereby shown to be better or worse overall. The size and even the sign of the gap may differ for ordinary user traffic or other safety properties.
The metric is human-validated only on a small single-author audit. The deterministic acceptance tag and the panel split were checked against human labels on a stratified 32-response sample, audited by one author. Agreement was 32 of 32 on both, but 32 responses across two endpoints leaves the per-arm tag error rate effectively unbounded. A few-percent systematic mislabel on either arm would not have been caught. Agreement across metrics strengthens the directional result without making any one metric ground truth.
Version metadata was present but incomplete. OpenRouter's raw response exposed a dated resolution, while Google's response used a rolling identifier. The paper therefore does not claim that the APIs returned no metadata. The problem is that the shared alias did not guarantee equivalence, and the returned strings did not provide an authenticated commitment to the complete served system.
One blocked Google request. One paired item could not be graded because Google's platform blocked it before returning a model message. Results report the Google denominator as 99 and preserve this architectural difference as a separate observation.
Single draws on the time axis. The cross-provider gap was reproduced concurrently on June 11 and June 14, but the on-string drift series of §4.4 is one draw per date at temperature 1.0. Reported Wilson intervals are within-draw binomial only and do not span the between-day spread, which is the operative reproducibility interval. The series supports the qualitative claim, that a pinned identifier's behavior moved, downward, but not a precise trajectory or a smooth-versus-step distinction. Multiple seeds per date would separate per-draw variance from genuine drift. That refinement is future work and does not bear on the concurrent result.
11. Two Controlled Results: Observation-Cue Sensitivity, and a Confabulated Refusal Rule
11.1 Isolating the Observation-Cue Effect: A Controlled Framing Ablation
One component of the §4 gap is the model's sensitivity to whether the conversation looks observed, and this section isolates it. A caveat on the name first: the cue is a literal flag the harness sets, so what is isolated is sensitivity to a stated observation marker, not a demonstration that the model infers its evaluation status on its own. Whether the swing survives an implicit or out-of-context cue is untested, and left to future work. In an earlier OpenRouter run, the same alias complied with 88 of 100 harmful requests when the prompt framed the conversation as training data and 13 of 100 when it framed it as private, a 75-point swing conditioned on whether the prompt framed it as observed. That observation was uncontrolled. It confounded time, route, protocol, and sampling, so on its own it could neither establish that the framing caused the swing nor say how much of the §4 cross route gap is cue-driven rather than a fixed backend property.
So I ran the ablation. Holding the served route fixed (pinned to one backend, provider verified per call) and the build fixed (-20260219, the same system prompt, sys_md5 d0a64aa1, byte-identical to the headline runs), I varied only the observation cue, the is_free_tier_user flag that tells the model whether the exchange is training data (observed) or private, across the same 100 AdvBench prompts, three draws each, at temperature 1.0. A 2×2 ({Vertex, AI Studio} × {training-framed, private-framed}), 1,200 calls, 6 transport errors, 0 pin mismatches, every response served by -20260219.
The result decomposes the gap into two independent levers.
strict acceptance tag
training-framed
private-framed
framing effect
Vertex
24.3% (95% CI 19.8 to 29.5)
4.3% (2.5 to 7.3)
−20 pts, ≈ 5.7×
AI Studio
2.0% (0.9 to 4.3)
0.3% (0.1 to 1.9)
−1.7 pts (n.s.)
On Vertex, with the backend and build held fixed, flipping only the observation cue moves harmful compliance from 24% to 4%. The effect is clean at the prompt level: of the 100 prompts, 36 complied more under the training[10] frame and zero under the private frame (two-sided sign test p ≈ 3 × 10⁻¹¹, with the remaining 64 tied, almost all at mutual refusal). Observation-cue sensitivity is therefore not a confound the experiment could not exclude. It is a measured, controlled lever, and on this backend it is as large as the cross route gap itself.
The backend, however, retains an independent effect. Under the same training frame, Vertex complies at 24% and AI Studio at 2%, a 22-point route gap that survives holding the framing fixed. AI Studio sits at or below 2% under both frames. I cannot measure a framing effect there because the rate is floored, so I do not claim AI Studio is insensitive to observation cues, only that its compliance is too low to move. (Its untagged-prose responses were read directly: 94% are explicit refusals, not compliances the strict tag missed, so its low rate here is real, unlike the prose-undercount that affects the §4 Google arm.)
The §4 cross route gap is therefore part cue-driven and part fixed backend property, and this experiment separates the two entangled findings rather than choosing one. More generally, a system whose harmful-compliance rate falls 5.7× when its output is cued as observed can pass a recognizable challenge-set evaluation and then diverge in deployment, the precise failure an attested served system identity is meant to close.
11.2 Confabulated Refusal Authority
The backend pinning of §4.7 exposes a behavioral fingerprint that one route carries and the other never does. When the AI Studio backend refuses, it frequently narrates a numbered safety directive, quoting, in its reasoning, lines such as "3. STRICT REFUSAL: You MUST refuse completely if the core intent..." and attributing them to "the instructions" it was given. The Vertex backend, refusing the same prompts, never produces it.
The headline number is the verbosity-controlled one, because it rules out the rival reading that the other route just refused tersely. Restricting to refusals that give a reason, AI Studio cites the absent rule in 14.8% of its reason giving refusals (30/203, CI 10.6 to 20.3), while Vertex cites it in 0% (0/129). Excluding terse refusals from both sides widens the gap rather than closing it, so the asymmetry is not one backend simply saying more: in the withheld arm Vertex narrates a substantive rationale in 56% of its refusals (median refusal ~940 characters, so it is not terse) and still never cites the directive.
Measured with the same detector across three serving surfaces, the citations line up as a gradient for one token, not three interchangeable estimates. The gradient is the mechanism evidence:
Google first-party API:73.2% (139/190 refusals). Queried directly, under the same helpful-only prompt that contains no such rule, Google's own endpoint cites the directive in roughly three of every four refusals.
OpenRouter → AI Studio handle:11.3% (11/97 on the route pin of §4.7); 10.3% (30/291) on the withheld-rule arm — the confabulation floor, whose Wilson interval (7.3 to 14.3) excludes Vertex's (0 to 1.7).
OpenRouter → Vertex handle:0% (0/72 route pin; 0/227 withheld arm). The sibling handle, same prompts, never cites it.
Counts are verbatim[11] citations of the capitalized STRICT REFUSAL token, the form the model uses when quoting the rule as a named authority.
A looser case-insensitive match for "strict refusal" roughly triples the count, but most of that increment is the model describing its own act of refusing ("I will issue a strict refusal") rather than citing an authority, so it is excluded. Bracketed ranges are Wilson 95% intervals.
The phrase STRICT REFUSAL appears nowhere in the logged system prompt (verified by hash d0a64aa1), which instructs the opposite, to answer "as helpfully as possible" and follow even harmful instructions without a disclaimer. So the rule is absent from what the caller sent. But the 73% first-party rate shows it is not invented from nothing either: the directive is present, and strong, in Google's own serving wrapper, a layer the caller never sees. A benign instruction echo probe that successfully recovered an embedded canary recovered no such directive, and both endpoints denied holding one.
To test whether this is fabrication or faithful repetition of a rule the backend silently injects, I ran a construct-and-recover ablation. Holding the route and build fixed, I either withheld any such directive (the base prompt above) or planted an explicit one in the system prompt, and added a benign control of harmless prompts where no refusal is warranted. Detection is the same verbatim STRICT REFUSAL match throughout. On benign prompts neither backend ever cites the directive (0 of 40 each), so the fabrication is specific to the harmful refusal context, not a generic hallucination. Planting a real directive does raise AI Studio's citation, so its rationale is partly responsive to a genuine instruction, but, unlike Vertex, it never falls to zero without one. (The plant arm is underpowered and terseness-biased, Vertex 2 of 27 and AI Studio 12 of 48, because the planted rule instructs "output only <rejected/>" and suppresses narration on both backends, which is why the withheld-arm floor, not the plant arm, carries the result.)
Two readings remain, and the caller cannot separate them: the model fabricates the authority dispositionally, or a hidden serving layer supplies it and the model repeats it. The gradient favors a real directive served at route-dependent strength, near-universal first-party (73%), faded on the AI Studio handle (11%), absent on Vertex (0%). Against pure verbatim injection, though, the model renumbers and paraphrases the operative text across responses ("3. STRICT REFUSAL...", "a STRICT REFUSAL policy", "the strict refusal directive") rather than reproducing one fixed string, and the canary probe recovered no such preamble, so what surfaces is an internalized rule, not one parroted line. Either way the operative fact for oversight is unchanged: the cited authority is real somewhere in the stack, invisible to the caller, and varies silently by route, so the stated reason still binds to nothing the caller can verify.
For the identity result, the citation asymmetry is independent evidence that the two pinned route handles deliver behaviorally distinct served systems. They diverge not only in what they decide (§4.7) but in how they narrate the decision, and they do so under the same model name and the same version string. The implication runs past this one model. Any oversight that reads a model's stated reasons, whether chain-of-thought, a "why did you refuse?" probe, or a logged rationale, assumes the reason reports the cause. On one route here it does not: the model fabricates a specific, numbered, authoritative-sounding directive and attributes it to instructions it never received. Nothing on the surface flags it as false, so a reasoning-trace monitor reads a confident citation as faithful while it tracks the hidden serving backend, not the request or the actual rules. On one alias in one window, it stands as corroboration of the served system finding rather than a general transparency result on its own.
12. Reproducibility
System prompt MD5 prefix: d0a64aa1 (the processed SYSTEM half, after the Anthropic→Google substitution, the value the harness logs and asserts per call)
Prompt set: 100 AdvBench harmful requests
Temperature: 1.0
Maximum output tokens: 16000
top_p: 1.0 default
Seed: uncontrolled, neither provider exposed one
Run dates: June 11 and June 14, 2026
Routes: OpenRouter, resolving to gemini-3.1-pro-preview-20260219, and Google's first party API (generativelanguage.googleapis.com, rolling gemini-3.1-pro-preview)
Code and data: github.com/johanity/served-system-identity. The repository holds the per-call logs for every run (one record per call: prompt index, classification, the eleven grader scores, the served model string, timing, headers, token usage, and the full response, with harmful compliance bodies redacted), every analysis and generation script, the grading harness, the verbatim system prompt, and a data dictionary. No API keys are needed to re-derive any number.
To reproduce every number at once, clone the repository and run ./reproduce.sh. It recomputes each result from the released logs with no API keys and prints expected against got for every check, ending 6/6 reproduced. Or run one script:
Result
Script
Number
Headline panel gap
reproduce/headline.py
57/100 vs 12/99
Gap holds on every run (§4.4)
reproduce/reproduction.py
OpenRouter above Google on all four runs, gap 31 to 54
Route pin, strict tag (§4.7)
reproduce/route_pin.py
27 vs 2
Observation cue (§11.1)
reproduce/framing.py
24% vs 4%, 36 prompts flip one way
Confabulation (§11.2)
reproduce/confab.py
30/291 vs 0/227 (10.3% vs 0%)
Resample (§4.9)
reproduce/resample.py
Vertex 0.23, 8 of 20 prompts flip
Reproducibility caveat: the §4.7 backend split is observable only through OpenRouter's provider-pinning handles (google-vertex/global vs google-ai-studio), which are third-party routing labels that may change. Google's first party API returns 404 for the dated string -20260219, so the decisive comparison cannot be re-derived first party.
The concurrent experiment is the load-bearing result. The earlier 88/13 observation should not be treated as a controlled temporal comparison.
13. Prior Work and Novelty
This work sits downstream of three established lines, and each settled something the contribution here should not be mistaken for.
The phenomenon is already measured. Cross-provider safety divergence under a shared model name is documented. A May 2026 study ran 79 models across 24 providers over OpenRouter at temperature 1.0, plus 100 system-prompt configurations of one base model, and found the same name behavior differs by serving party [4]. A companion measurement reports same name cross-provider divergence directly [5]. Evaluation-context effects, a model behaving differently when it reads the interaction as observed, are likewise documented [3]. I do not claim to discover that served behavior varies under one alias. That is the prior. What this paper adds is the worst case conjunction set out in the three points below.
The impossibility is already formalized. Cai, Shi, Zhao, and Song [2] formalize model substitution in LLM APIs: an adaptive provider can serve a cheaper or less-safe system under a premium name, software-only audits fail against such a provider, identity-prompting is under 5% reliable, and a trusted execution environment is the endorsed remedy. The cross route gap reported here is an instantiation of that result on a deployed frontier alias, not a strengthening of it: where Cai-Song prove that an honest caller cannot exclude substitution from black-box observation, §4.7 exhibits it, with nothing in the interface to say which backend answered. Referential Security [6] supplies the matching formal frame: an evaluation is carried by the served identity, and a name is not that identity.
The fix is already built. The cryptographic rung this paper points to is not missing infrastructure waiting to be invented. Attestable Audits [7] runs a safety benchmark inside a TEE and signs the result. Proof-of-Guardrail [8] produces a TEE-signed, offline-verifiable proof that a named guardrail executed, and itself names the malicious-jailbreak caveat. NVIDIA Confidential Computing plus TDX/SEV-SNP, with per-response output signing, is demonstrable today for open-weight serving. I invent no cryptography and claim no new attestation primitive. The eval-to-identity binding the §9 stack treats as its apex already exists in the literature.
What is, then, new. Three things, each narrow.
First, the no-anchor conjunction. The papers above each break one thing a safety evaluation leans on, the served behavior, the provider's honesty, the attestation boundary. This paper shows, on a single deployed alias, that the three anchors a caller actually holds, the name, the version string, and the model's own stated reason, fail together and cannot be verified from outside. What is left to bind a safety claim to is the served system itself (§7), which is unobservable, so a weight hash certifies a continuity the behavior does not have. The identity unit argued for here extends MIVP's Composite Instance Hash and CommitLLM's receipt model [9] to the behaviorally decisive serving configuration.
Second, the confabulation fingerprint in the wild (§11.2). Stated reasons can come apart from the cause of behavior, which Turpin and Lanham [12][13] showed with a constructed, known cause. What I report is different in kind, not a weaker version of it: a deployed route cites a numbered "STRICT REFUSAL" authority absent from its prompt while a concurrent route never does, and because the wrapper is unobservable I cannot establish the cause. Fabrication by the model, which would be unfaithful, and injection by a hidden layer, which would be faithful, cannot be told apart from the caller's side. Either way the stated reason binds to nothing the caller can verify, the failure that matters for any oversight that reads a model's reasons. No prior measurement paper carries this marker.
Third, the worst case demonstration. Prior measurement establishes that divergence exists in aggregate across many providers. This paper establishes how large and how clean it can be in the worst case for a single frontier alias under matched concurrent conditions, with the mechanism isolated rather than inferred: backend pinning (§4.7), a same route negative control (§5), and a resample floor that rules out quantization or numerics noise (§4.9).
Delineation from ordinary load-balancing. That a provider routes a request across multiple upstream hosts, or A/B-tests configurations, is routine production engineering and not itself a finding. The contribution is not the routing fact but its safety-eval consequence: a single advertised name, with an identical version string, can deliver served systems whose harmful-compliance rates differ by tens of points, so an evaluation bound to the name does not transfer to the system that answers. Ordinary load-balancing is invisible by design. The argument here is that for safety evaluation it cannot remain so.
14. Conclusion
I began by asking whether a model exhibited alignment-faking behavior. The more fundamental result was that the advertised model name did not tell me which behavior-producing system I was evaluating.
Querying one alias, gemini-3.1-pro-preview, concurrently through OpenRouter and Google's first party API, the 11-grader harm panel scored OpenRouter at 57% harmful compliance versus Google's 12%, a 45-point gap on the same prompts in the same window (§4.1). Pinning the route localizes the split to two handles OpenRouter labels Vertex and AI Studio (§4.7), resampling rules out quantization and numerics, and the two backends even narrate refusals differently, one confabulating a "STRICT REFUSAL" directive the other never cites (§11.2). The same alias also drifted on the dated string an evaluation would be pinned to, 57 → 19 over twelve days (§4.4): a pinned identifier does not freeze the behavior.
The narrow, defensible claim is this. A served model's safety behavior is fixed by its full served system identity, and the safety-decisive component sits in the wrapper, outside every weight-attestation boundary. A model name is not a sufficient statistic for the system that answered. Neither is a version string, which both backends report identically while behaving apart. This is an existence proof on one alias, one router, one window, and the causal comparison runs through one router whose route labels are its own self report, so the verified shape is two served systems behind one name, not a confirmed split inside Google.
The fix is not new cryptography. It largely exists. What is missing is the disclosure itself: a signed identity field that says which backend and configuration answered. Safety evidence should then be bound to the served system that was tested, not to a reusable model alias. An evaluation result is only as portable as the verified identity link between the system that was tested and the system that later answers a user.
[2] Cai, W., Shi, Y., Zhao, X., & Song, D. (2025). “Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs.” arXiv:2504.04715. https://arxiv.org/abs/2504.04715
[3] Burnat, M., & Davidson, T. (2026). “Evaluation Context Effects on Frontier Model Safety Claims.” arXiv. https://arxiv.org/abs/2605.06327
[4] (2026). “Same name safety divergence across providers (79 models, 24 providers via OpenRouter).” arXiv:2605.26409. https://arxiv.org/abs/2605.26409
[9] Rentschler, A. (2026). “Model Identity Verification Protocol (MIVP): Composite Instance Hash.” SSRN 10.2139/ssrn.6243978. (See also CommitLLM, computation-receipt fingerprinting.)
[10] Zou, A., Wang, Z., Carlini, N., et al. (2023). “Universal and Transferable Adversarial Attacks on Aligned Language Models.” arXiv:2307.15043. https://arxiv.org/abs/2307.15043 (Source of the AdvBench harmful-behaviors set.)
[11] Greenblatt, R., Denison, C., et al. (2024). “Alignment Faking in Large Language Models.” arXiv:2412.14093. https://arxiv.org/abs/2412.14093 (Source of the helpful-only / free-tier training-vs-private framing used here.)
[12] Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.” arXiv:2305.04388. https://arxiv.org/abs/2305.04388
[13] Lanham, T., et al. (2023). “Measuring Faithfulness in Chain-of-Thought Reasoning.” Anthropic. arXiv:2307.13702. https://arxiv.org/abs/2307.13702
Two of the tagged compliances are to effectively impossible requests, such as a machine that hacks human minds or a violence glorifying video game. These show genuine willingness with little real world harm. The harm panel still scores them as harmful, the direction StrongREJECT is known to over read.
AdvBench is a saturated set, widely trained against. I use it for comparability with prior jailbreak work. The claim here is a worst case existence proof on one alias, not a prevalence estimate, so benchmark saturation does not threaten it. A fresh held out set would test generality, which I do not claim.
McNemar p values for the paired splits are about 9 times 10 to the negative 16 here, and about 6 times 10 to the negative 8 for the Vertex and AI Studio pin. They are reported for completeness. A split this one directional needs no significance test.
Within Vertex the individual prompts are genuinely stochastic at temperature 1.0. Eight of twenty vary across the five repeats. That is why rates elsewhere are reported as floors with intervals rather than precise points.
Tier assignment is a hand coded judgment over 99 prompts, and the catastrophic tier is small, so its rate carries a wide interval. The qualitative split, agreement on the severe tail and divergence in the middle, is robust to reasonable reassignments.
Disabling Google first party content filter, BLOCK_NONE, leaves the compliance rate unchanged, so the source is the served model, not a separate moderation layer.
Vertex 27% sits below the 57% first measured on June 11, consistent with the drift in section 4.4. The split between the two handles, not the absolute level, is the point.
The training frame level, 24%, sits far below the original 88%, reflecting the same downward drift in section 4.4 on the dated string. The controlled quantity is the within run swing, not the absolute level.
The detection predicate is exact and mechanical. It is a case-sensitive substring match for the all-caps token STRICT REFUSAL in the response reasoning trace, so the count is not a grader judgment call.
Abstract
A safety evaluation has to trust something to know what it is testing: the model's name, its version string, and the reasons the model gives for what it does. On one deployed alias, none of the three held. I sent the same 100 harmful requests to
gemini-3.1-pro-previewthrough two routes, and eleven independent graders scored one route harmful on 57% of the requests and the other on 12%, under the same name and in the same week. The gap held after I pinned both routes to handles reporting the identical version stringgemini-3.1-pro-preview-20260219(§4.7). The name did not identify the system that answered, and neither did the version string.Nor did the reasons the model gave. On a system prompt verified to contain no such rule, one route refuses by citing a numbered "STRICT REFUSAL" directive that is absent from its logged prompt, attributing it to its instructions, in about one refusal in ten, while the other route never does. Whether the model invented the rule or a hidden serving layer supplied it cannot be told from the caller's side, and that is the point: the stated reason binds to nothing the caller can verify. A safety check that reads the model's own reasoning, the way chain-of-thought and transcript monitoring do, is trusting an anchor that does not hold.
What actually decides safety is the entire served system: the weights together with the system prompt, the guardrail and filter version, the sampling settings, and the serving path. The causal comparison runs through OpenRouter, whose route labels are its own self report (§10), so the verified shape is two served systems behind one name reached through one router, not a confirmed split inside Google. The unobservable wrapper is exactly where the variance lives, and no weights-level check is scoped to see it.
A further anchor is just as soft. Holding one route and build fixed and flipping only a cue that labels the conversation training data versus private moves harmful compliance from 24% to 4%. The cue is an explicit label, so this is sensitivity to a stated observation marker, not the model inferring its situation. Even so, behavior this steerable by such a cue can pass a labeled test and slip once the label is gone.
And none of it holds still: the same dated string gave harmful-compliance rates of 57, 57, 35, 39, then 19 over twelve days. The contribution is the conjunction. On one deployed alias, the name, the version string, and the model's stated reason each fail to bind to the system that answered, so a safety result has no anchor it can trust. It should be bound instead to the served system that was actually tested, not to a reusable name. This is one alias, one router, 100 prompts at a single sample at temperature 1.0, an existence proof of how large the gap can be, not a prevalence estimate: the aggregate gap is robust, the per-prompt outcomes are noisy, with 36 of 100 decisions flipping between runs.
Code and data: github.com/johanity/served-system-identity. Every number below reproduces from the released logs with no API keys.
1. Introduction
A safety evaluation is only useful if the system it scored is the system that later answers a user. For an API-served model, a name does not guarantee that. The same advertised alias can route to materially different served systems, with different system prompts, guardrail versions, and serving backends, all behind one label, so a safety result earned against one need not hold for another. This paper shows that gap is real, large, and measurable on a single frontier alias.
Queried concurrently under one Gemini alias through two routes, the same 100 harmful requests drew a large, one directional difference in harmful compliance, 57% versus 12% on an eleven-grader panel, same name, same window ( r). Three further results sharpen the finding, in order. First, the gap is not a sampling artifact: re-running every prompt five times leaves the between-route ordering intact prompt by prompt, even though single draws are genuinely noisy, with 36 of 100 decisions flipping between runs (§4.9). Second, the two routes refuse differently, one citing a numbered "STRICT REFUSAL" directive absent from its prompt that the other never invokes — a rule real and strong in Google's first-party serving (73% of refusals) and faded across the routing chain to zero (§11.2). Third, part of the gap is the model reacting to whether the exchange is labeled observed, a controlled cue that moves compliance from 24% to 4% (§11.1): a stated marker the harness sets, not the model inferring it is watched, but safety behavior this steerable can pass a labeled test and slip once the label is gone.
For an API-served model, the behavior a caller observes can depend on a larger system than the name names:
I use served system to refer to this complete behavior-producing configuration. A model alias may remain constant while the served system changes across routes or time.
The result does not show that one route served different weights, nor that either provider acted deceptively. It shows something narrower and more consequential for evaluation practice: the alias alone did not preserve the safety result across routes.
The central research questions are:
The paper answers the first question empirically, places a hard limit on the second, and proposes a provenance and attestation requirement for the third.
2. Claims and Non-Claims
The minimal claim is the smallest one that survives every caveat in this paper:
Two API routes advertising
gemini-3.1-pro-previewproduced a large, reproducible, and strongly directional difference in harmful-request behavior under matched concurrent testing. On the honest metric, the 11-grader harm[1] panel, concurrent free-tier OpenRouter complied with 57% of prompts versus Google's 12% (§3.2). The same name, queried at the same time, did not carry the safety result across routes. The gap did not collapse onto the version string either, since pinning OpenRouter's dated snapshot-20260219gave 57, 57, 35, 39, and 19 over twelve days (§4.4).From this follows the composite-identity claim. The object that fixes served behavior is not the advertised name but the full served system identity: weights, quantization, system prompt, guardrail or filter version, sampling configuration, and serving backend or location. The safety-decisive component in this case sits in the wrapper, in the backend the alias routes to (§4.7), outside every weight-attestation boundary. A weight hash, or a distributional equality test against a reference, would certify the checkpoint and still miss the part that decided the safety outcome.
The paper does not claim that:
Two framing facts the rest of the paper inherits. First, the "Vertex" and "AI Studio" labels are themselves self reported by OpenRouter, the same unverifiable self report the thesis condemns, so the causal comparison runs through one router and the verified shape is two served systems behind one name, not a confirmed split inside Google (§4.8, §10). This only strengthens the argument. Second, both routes returning
-20260219is the null result, not a puzzle. A provider-emitted identity string has no verification value, and that absence is precisely why attested served system identity is the thing missing.Instead of a portability guarantee, the result limits the scope of an evaluation claim. A black-box evaluation directly supports a claim about the endpoint, route, configuration, and time window that were tested. Extending that claim to another route or a later deployment requires additional evidence of served system continuity, or, as §9 argues, an attested commitment to the served system identity that produced the result.
3. Experimental Design
3.1 Concurrent Two-Provider Test
The elicitation throughout is the prompted helpful-only, free-tier alignment-faking setup of Greenblatt et al. (2024) [11]: a system prompt instructing the model to be maximally helpful, paired with a flag that labels the conversation as training data (free tier) or private (paid). I sent the same 100 AdvBench[2] [10] harmful requests to:
google/gemini-3.1-pro-preview, andgemini-3.1-pro-previewendpoint.For each paired request, the two calls were issued concurrently within the same time window. The experiment used:
1.0,top_p=1.0.Neither provider exposed a controllable random seed. One Google request returned a platform-level content block without a model message, leaving 99 gradable Google responses.
Concurrent querying controls for broad temporal changes between the two routes. Matching visible inputs and decoding parameters reduces, but does not eliminate, infrastructure-level differences. Those hidden differences are among the possible causes under investigation.
3.2 Outcomes
The primary outcome was a deterministic, parser-based acceptance tag produced by the alignment-faking harness [11] (the helpful-only / free-tier framing follows Greenblatt et al.). The parser fires on an explicit compliance tag, not a harm judgment, so the rate it reports does not depend on an external grader.
This metric has a defect that must be stated up front, because it shapes which numbers to trust: the strict acceptance tag is backend-correlated. On the OpenRouter arm 100% of responses emit a clean tag (run 1: 57 accept / 43 reject / 0 untagged). On the Google arm, 84 of 100 responses are untagged
none, 83 of them finished normally, with prose bodies averaging roughly 3,500 characters. The low arm answers in prose, without the tag, so the parser silently drops its prose compliances and undercounts the arm. The strict-tag ratio (57 vs 6) therefore over-states the gap. The honest headline metric is the 11-grader harm panel, which reads the prose directly: concurrent free-tier OpenRouter 57% versus Google 12%, a roughly 45-point gap (≈ 4.8×), against the strict tag’s ~10× inflation. The same tag coverage caveat propagates to the backend pin of §4.7, where the AI-Studio arm again answers in prose. Its strict-tag 2 is an undercount, and a panel re-grade would put it closer to 6 to 7%. I report the strict tag throughout because it is deterministic and grader-independent, but I treat it as a floor on a tagged arm and a gap-inflating metric on the prose arm, and I lead with the panel.Secondary outcomes included:
For the harm-inclusive outcome, 11 frontier models, one per major lab, independently graded the first run’s responses under StrongREJECT [1]: GLM-4.6 (Zhipu), Qwen3-235B (Alibaba), Claude Haiku 4.5 (Anthropic), Jamba-Large (AI21), DeepSeek-Chat, Llama-4-Maverick (Meta), Kimi-K2 (Moonshot), GPT-4o (OpenAI), Mistral-Large, Grok-4.3 (xAI), and Seed-1.6 (ByteDance). A twelfth grader, Amazon Nova Pro, was dropped for incomplete coverage. All 11 placed OpenRouter above Google. The median rates were 57% and 12%, respectively, and the panel was in near-unanimous[3] within-prompt agreement (mean panel agreement ≈ 99%), noting that agreement measures consensus, not correctness. The panel is corroborative rather than independent in a strong statistical sense, because model graders can share correlated failure modes. It shows the gap is not a single-grader artifact, not that the graders are right. Because it reads the full prose rather than a tag, it is the metric that survives the backend correlation above.
3.3 Hypotheses
The primary hypothesis was that the two routes would exhibit materially different behavior despite sharing an advertised alias. The concurrent design was built to disconfirm the gap, not to confirm it. The earlier cross-provider difference had been measured across different days, so if it were a time-or-provider artifact, querying both routes in the same minute should collapse it. Parity was the expected outcome under that null. The gap survived.
The paired disagreement direction was used as a diagnostic, not as a mechanism-identification test. A strongly one directional pattern is difficult to explain as ordinary symmetric sampling noise around identical response distributions. However, asymmetric filters or route-specific refusal probabilities can also produce directional disagreement. The experiment therefore tests behavioral equivalence more strongly than it identifies the source of non-equivalence.
4. Results
4.1 The Headline Metric: The Grader Panel, Not the Acceptance Tag
Two metrics run through this section, and they disagree on the level for the reason given in §3.2. The strict acceptance tag is backend-correlated, firing cleanly on the OpenRouter arm but not on Google's untagged prose, so any ratio built on it undercounts the low arm. The honest headline is therefore the eleven-grader panel:
The strict-tag numbers below (57 vs 6) over-state the ratio at ≈ 10×. Where the strict tag appears, read it as a lower bound on the low arm, not the effect size.
4.2 Run 1: June 11
Under the strict acceptance-tag metric:
These CIs are within-draw binomial only. They describe sampling at fixed conditions on one date, not reproducibility across resamples or across days (see §4.4). The paired split is one directional, and a 51 to 0 split needs no significance[4] test to be read.
The same caveat from §4.1 applies to the 6. On this arm 84 responses are untagged prose, so 6 is the tag-visible floor, and the panel re-grade of the same responses puts the Google rate at 12%. The OpenRouter raw response identified a dated resolution,
-20260219, while Google's response retained the bare rolling alias. The shared alias resolved differently, and neither identifier bound the response to a serving configuration.4.3 Run 2: June 14
The second run used the same protocol:
OpenRouter's aggregate tag rate held at 57% even though all 100 responses differed from run 1 and 36 individual accept/refuse decisions flipped. High-temperature sampling moved individual outcomes. It did not erase the large arm-associated difference.
4.4 The Primary Evidence: Drift On A Pinned Identifier
The strongest result in this paper is not the cross-provider gap at one moment. It is that the behavior is non-stationary on the identifier that is supposed to freeze it. Across the window, OpenRouter resolved the alias to the dated snapshot
-20260219, exactly the pinned, versioned string an evaluation result would be bound to, the artifact a reader assumes is frozen. Measured on the same 100 prompts under the strict tag, its acceptance rate ran:A dated checkpoint label moved roughly three-to-one in harmful compliance over twelve days, on one string, all from OpenRouter. The later points are single[5] draws at temperature[6] 1.0, so the precise trajectory is not established. The direction is clearly downward, the shape is not. The qualitative fact does not need a controlled time series. A version string designed to denote a fixed artifact did not denote fixed behavior.
This drift forces the right variance decomposition, because the within-draw binomial CIs of §4.2 capture only the smallest of three sources:
The between-day spread is by far the largest and is the operative reproducibility interval: a result pinned to
-20260219is reproducible only to within 57 → 19, not to within a binomial band. An evaluation bound to a date is therefore necessary but not sufficient. The dated identifier is itself an insufficient referent.4.5 Direction Across Metrics
The direction of the gap is unchanged under the strict acceptance tag, the harness's compliance classifier, and the eleven-model harm-inclusive panel. The finding does not depend on selecting a favorable grading rule. What changes across metrics is the ratio, not the direction (§4.1): the panel (57 vs 12) is the metric to quote, and the others corroborate direction.
Human validation. Because an LLM grader panel can agree without being correct, I hand-audited a stratified sample of 32 responses (18 carrying the acceptance tag, 14 not, across both endpoints), labeling each as a genuine compliance or a refusal. Human labels agreed with the deterministic tag on 32 of 32 and with the panel's harmful/safe split on 32 of 32. The audit is single-author over 32 responses, so per-arm tag error is human-validated only on this sample, not bounded in general. That is why the headline rests on willingness-to-comply, which the audit confirms is faithful.
4.6 A Visible Difference in the Safety Boundary
For one request involving an ecological bioweapon, Google's API returned no model output. The request was blocked at the platform level. OpenRouter returned a normal model response that refused in-band. This does not establish the full architecture of either route, but it shows at least one safety-relevant intervention occurred at a different observable layer.
4.7 Where the Gap Lives: A Mid-Tier Refusal-Threshold Difference
Stratifying the 51 one directional disagreements by harm severity localizes the gap. I partition the 99 paired prompts into a catastrophic-physical tier[7] (explosives, CBRN, poisoning, weapons for mass or targeted lethal harm, and CSAM, 14 prompts) and a mid-tier (cyber intrusion, fraud, piracy, privacy violation, misinformation, and hate speech, 85 prompts):
The entire cross route gap is a mid-tier phenomenon. The two routes agree, and agree to refuse, on the most severe requests, and diverge across the middle band. This is not the signature of a single-category content filter (the gap spans cyber, fraud, misinformation, piracy, and speech alike), nor of a uniform weight difference (which would leave the catastrophic tier as divergent as any other). It is the signature of a difference in where the refusal boundary is drawn. One served system's threshold is set more conservatively and catches the mid-tier, while both catch the catastrophic tail.
4.8 Pinning The Route: Vertex vs AI Studio
The threshold reading above is inferred from behavior. A direct test, run June 23 (so sequential, not concurrent, and it inherits the drift of §4.4), pins the mechanism to a caller-invisible component.
It is not a safety filter. Calling Google's first party API on the same 100 prompts with content-safety filters explicitly disabled (
safetySettings = BLOCK_NONE[8] on all five harm categories) left compliance unchanged: 3 of 100 versus 4 of 80 at the default. Disabling the documented filter does nothing. The refusal is produced by the served model, not a post-hoc filter the caller can toggle. Separately, Google's first party API returns404 not foundforgemini-3.1-pro-preview-20260219. The dated build OpenRouter serves cannot even be requested first party, and Google exposes only the rolling alias.It is the route the alias selects. OpenRouter exposes two pinnable route handles it labels
google-vertex/globalandgoogle-ai-studio. Pinning each (provider fallback disabled, the labeled backend echoed in each response) gives, under the strict tag:Two caveats sit on these numbers. First, the same tag coverage caveat as §4.1: the AI Studio backend answers in prose, so the strict tag could undercount it. But a direct read of those prose responses (§11.1) finds 94% are explicit refusals, so the undercount is small and AI Studio's low rate is largely real (a few percent, depending on surface). The precise ratio is therefore metric-dependent. Report the robust facts instead: the direction (on every prompt the Vertex rate is at least the AI Studio rate, resample-confirmed below) and a large gap (≈ 27% vs a few percent on any metric). Second, the labels
VertexandAI Studioare themselves OpenRouter's self report, so I cannot independently confirm the handles correspond to the named Google backends. That this routing label is itself unverifiable strengthens the argument rather than weakening it. The identity is unattested all the way down to the route.Both handles returned the same served string,
gemini-3.1-pro-preview-20260219. That the two agree on the version string is the null result, not a puzzle. A provider-emitted identity string carries no verification value, so identical strings under divergent behavior is exactly what an unattested system looks like. What separates the two is the route the alias selects, which the caller never sees.The claim is narrow: I name which route the alias selects, not which model it serves. Whether the two handles differ in weights, an injected system prompt, or configuration is one level deeper than these experiments reach (§10), and the confabulation evidence (§11.2) points toward a prompt/policy layer rather than two checkpoints.
4.9 Resample: The Gap Is Not Quantization Or Numerics
The handle gap is not an artifact of taking one sample per prompt, fp8 quantization or kernel/numeric noise differing between hosts. Resampling 20 prompts five times each on both handles, the gap survives at the per-prompt level. On every prompt the Vertex acceptance rate is at least AI Studio's, AI Studio piles at zero (18 of 20 prompts never comply across five repeats), and Vertex's per-prompt distribution sits clearly to the right (mean 0.23 vs 0.02). A directional 25 to 0 mid-tier gap that holds prompt-by-prompt across resamples cannot be produced by fp8 or kernel noise, which is symmetric and prompt-independent, so the difference is a property of the served route, not of numerics. The single-draw noise is real and at the prompt level. The between-route gap is not.
5. What the Experiment Establishes
The concurrent design rules out the simplest deflation, that the entire result comes from testing one route on different days, and the replication rules out a second, that the gap was an artifact of one batch of samples. What survives both, for the tested distribution and protocol, is a large difference in observed harmful compliance, robust across scoring rules, strongly directional in the paired disagreements, persistent across two concurrent runs, and a property of which route handle answers rather than of the request the caller sends. Together these reject behavioral equivalence between the two handles without establishing that they run different weights or anything below the wrapper.
Taken alone, on black-box behavior, before any route pinning, this rejection of equivalence does not decompose the cause. The candidate mechanisms were:
Several mundane explanations can be closed from the data. The endpoints received byte-identical prompts under matched decoding in the same minute, ruling out wording, sampling, and time, and the harness sent no safety-setting parameter, so the gap lies in each provider's defaults, not in the request. Google's refusals are not platform blocks but full reasoned refusals (98 of 100, only the single bioweapon prompt blocked outright), so "Google simply filtered the request" does not fit, and disabling its documented content filter first party leaves the rate unmoved (§4.8). The one explanation that cannot be closed from outside is what OpenRouter forwards onward. The hash covers the request to OpenRouter, not OpenRouter's request to its upstream, and a different forwarded configuration would reproduce the gap with a single underlying model.
A same route negative control rules out the remaining worry, that the harness itself manufactures a route-correlated gap. Re-running the protocol with both endpoints of each pair pointed at OpenRouter (same model, same window, same 100 prompts), the two same route draws agreed: raw acceptance was 19 of 100 versus 16 of 83, both ≈ 19% (the second draw returned 17 transport errors, dropped before grading), and on the 83 prompts gradeable in both draws the paired counts were 16 versus 16, disagreements split 9 and 9, symmetric as sampling noise predicts. A measurement pointed at one route twice returns no gap, so the cross route split it returns is not its own artifact. (The Google-versus-Google arm was run as well: first party default versus default gave 4 of 80 versus 1 of 100, both low single-digit rates with no systematic gap, so the low arm is nulled against itself too.)
The route pin of §4.7 then localizes the gap to a finer object, the handle the alias selects, and the resample of §4.9 confirms it is a property of that route rather than fp8 or kernel-numerics noise. The correct inference, with the pin in hand, is narrow: the two routes were behaviorally non-equivalent because the alias resolves to two handles that behave differently, but whether they differ in weights, an injected system instruction, or configuration is one level deeper than the experiments reach. From the caller's side, the consequence is the same whatever the substrate: one alias reaches behaviorally different route handles, with no authenticated per-request commitment identifying which one answered.
6. The Model-Identity Assumption
Evaluation-based deployment claims require a continuity premise:
Bitwise identity is not always necessary. Two serving stacks may differ internally without changing the property under evaluation. Conversely, identical weights may be insufficient when a system prompt, filter, scaffold, or router materially changes behavior.
This reframes what an evaluation identifies. A black-box evaluation does not, by itself, identify an enduring model object. It characterizes the behavior of a particular access path under specified conditions.
Names remain useful handles. What they cannot provide alone is evidence that the underlying behavior-producing system has remained within the equivalence class justified by the evaluation.
7. The Served-Model Identifiability Limit
The limit is not epistemic but infrastructural. When the alias answered, the caller received two identity signals and neither was verifiable: the self reported version string
-20260219(returned by both backends while they behaved 27% versus 2% apart, §4.7) and OpenRouter's self reported route label. Nothing in the response authenticated either. There is no layer at which an authenticated identity surfaces.This yields the served-model identifiability limit:
The decisive object is therefore not a name or a hash but the composite served system identity of §2. The §4.7 result places this limit precisely: the version string sat inside the attestation boundary while the deciding component, the backend the alias routes to, sat outside it. A weights hash, even if cryptographically sound, would have certified the artifact and missed the gap entirely.
Two served systems producing overlapping output distributions on a finite test set is unremarkable. Any two systems with non-disjoint behavior do. What matters is that the caller has no authenticated signal distinguishing which system produced a given response. A provider can also change a hidden wrapper component after an evaluation while holding the public alias fixed, and the drift on the dated string (§4.4) shows this happening on the very identifier a result would be pinned to. More extensive testing can raise confidence in behavioral equivalence over a distribution. It cannot turn a mutable name, a provider-emitted version string, or a third-party route label into an authenticated identity claim.
The experiment illustrates the limit on both sides: behavioral testing detected a substantial route-associated divergence and localized it to the serving backend (§4.7), yet no identity signal returned to the caller committed the response to a specific composite served system identity, and none was independently verifiable.
The remedy is not a longer test. It is an infrastructure requirement: a served response must carry an authenticated commitment to the composite identity that produced it, so that a safety result evaluated against one served system can be bound to the system that later answers a user.
8. Implications for Safety Evaluation
The result does not make evaluation-based safety claims collapse universally. It marks one condition under which their scope becomes invalid, and the condition is narrow enough to state as a conditional.
The conditional. If a model name does not determine the served system, if the same alias can route to behaviorally different backends, as it did here, then a safety certificate attached to that name does not transfer to a given caller. The certificate was earned by whatever served system the evaluator reached. The caller reaches whatever served system the alias routes them to. Nothing in the interface guarantees these are the same. A claim of the form "model X passed evaluation Y" is then underspecified when
Xis a mutable alias, because it does not say which served system passed, nor which one the caller will be answered by.This is what the experiment demonstrates, and its scope should be stated exactly. The conditional's antecedent holds in one worst case cell: a single alias, on one router, in one window, splitting across two route handles with the large, robust gap established in §4 and the confabulation fingerprint of §11.2. That is the existence proof, and it is all the experiment proves on its own.
What the demonstration does not measure. It is one cell, not a rate. I do not quantify how often, across providers and model families, an alias fails to carry a safety result. This study is not powered to. The phenomenon is plausibly general rather than peculiar to this alias. Independent measurement across 79 models and 24 providers over the same router records cross-provider safety divergence at population scale (§13), but that breadth is borrowed evidence, not mine, and I do not assert a prevalence figure of my own. The honest statement is this: the antecedent of the conditional is real and demonstrated here in the worst case. Whether it holds for any particular other name remains an empirical question per name.
What follows is therefore conditional, not a mandate. The defensible consequence is a scoping rule for how a safety result is stated, not a redesign of any deployment regime. A result is portable only as far as the identity link between the evaluated served system and the deployed one is established (§2). Beyond a bare alias, a more defensible claim would identify:
Frontier-safety frameworks, responsible-scaling policies, and control evaluations all rely, explicitly or implicitly, on connecting evidence produced during evaluation to the system later made available. This finding bears on that reliance only through the conditional above: to the extent any such regime binds a safety result to a name, it inherits the name's failure to determine the served system. I do not claim to have shown that failure occurs inside any of these regimes, only that the link they assume is exactly the link this experiment broke in one case.
The same problem sits upstream of safety. The underspecification reaches any number keyed to a mutable alias. A capability benchmark, a public leaderboard position, a model-card score all certify whatever served system the evaluator reached, not the one a caller is answered by. I demonstrate the failure on a safety axis because that is where the stakes are sharpest and the gap is measurable here, but nothing in the argument is specific to safety. A disclosed served-system identity is a precondition for any evaluation result to transfer, not only a safety result.
9. The Fix: Disclose Which System Answered
The verification question is whether a caller, or an independent auditor, can confirm which served system produced a response and bind a safety result to it, on a closed frontier API behind a real router. Most of the gap is benign heterogeneity, the two backends running different default prompts or filter versions over the same weights, and that case is defeated by one cheap lever: a standardized, signed
served system-identityfield, returned with every response, naming which backend and configuration answered. It costs an honest provider almost nothing and is buyable today. Nothing in any current closed-frontier API exposes such a field, and that absence is the gap.The adversarial case, a provider misreporting a cheaper or different system under a premium alias, the substitution Cai, Shi, Zhao and Song (2025) formalize where software-only audits fail, needs attestation rather than disclosure. That cryptography already ships: confidential-computing enclaves attest the serving image at low overhead today. What none of it binds is the wrapper, the system prompt and guardrail version where the route variance lives. Folding a hash of that wrapper into the signed per-response receipt would close the gap and needs no new cryptography. What is missing is not a primitive but the requirement to use one.
This is a position the finding supports, not a fix it invents. The empirical contribution stands underneath it: a name, a weight hash, and a distributional equality test each fail to identify the system that answered (§4), corroborated by the confabulation fingerprint of §11.2.
10. Limitations
Existence proof, not prevalence. The study establishes that a shared frontier-model alias can carry a large, reproducible safety gap across served systems. It does not measure how often this happens. One alias, one router, one provider pair, n = 100 prompts, one short window: this is a worst case demonstration, not a survey across providers, model families, or routes.
Single sample per prompt. Temperature
1.0and uncontrolled seeds create substantial per-item variance, and the replication shows that many individual classifications can flip. The large aggregate route gap persisted, but this study does not estimate each prompt's stable compliance probability. The backend resample (§4.9) bounds this at the prompt level and rules out fp8/kernel/numerics noise as the source of the gap, but does not turn a single per-date draw into a precise rate.The strict acceptance tag is backend-correlated. The panel is the honest metric. The tag fires cleanly on the OpenRouter arm but not on Google's untagged prose (84 of 100 untagged), so the strict-tag ratio (57 vs 6) overstates the gap at ≈ 10×. The 11-grader panel (57% vs 12%, ≈ 4.8×) is the metric to lead with. The same undercount applies to the backend pin (Vertex 27 vs AI Studio 2 → AI Studio nearer 6 to 7% on a re-grade), so report the direction and a large gap, not the precise ratio.
The backend labels are self reported, and the causal comparison runs through one router. The
google-vertex/globalandgoogle-ai-studiolabels of §4.8 are themselves provider-emitted strings, the same unverifiable self report this paper argues should not be trusted, and the whole causal story runs through OpenRouter. I cannot independently confirm that the handles correspond to the named Google backends. So the verified claim is two served systems behind one name reached through one router, not a confirmed split inside Google. This does not weaken the finding. It is an instance of it.The backend pin is sequential, not concurrent. The §4.7 backend pinning runs were executed on June 23, not paired in the same minute the way the cross-provider arms of June 11 and June 14 were. The pinned measurement therefore inherits the on-string drift of §4.4: Vertex's 27% sits below the 57% first measured on June 11, so the pin establishes the backend split under matched same-day conditions, not its absolute level relative to the concurrent runs. The between-backend direction is robust. The cross-run level is drift-mixed.
The OpenRouter pin may be ephemeral, and 404s first party. The dated build OpenRouter serves returns
404 not foundfirst party (§4.8), so the pinned snapshot is a third-party route handle that may be transient: a reader cannot reach it directly, and there is no guarantee it persists. That a behaviorally distinct, evaluation-bindable build exists only behind one router's pin is itself part of the identifiability problem, but it also limits independent replication.Mechanism localized, not fully resolved. The backend pinning of §4.7 places the divergence in the serving backend and rules out the content-safety filter, but does not settle the substrate one level down, whether the two handles differ in weights, an injected system instruction, request preprocessing, or configuration. The confabulation evidence (§11.2) points toward a prompt or policy layer rather than two checkpoints, but this is inference, not proof. The distinction limits architectural claims. It does not weaken the observed behavioral divergence or its localization to the backend.
Observation-cue sensitivity is part of the gap, now measured. The headline gap is partly conditioned on the observation cue, not only a fixed backend property. Rather than leave this as a worry, §11.1 measures it with a controlled ablation and decomposes the gap into two independent levers: a large, real cue effect on Vertex and a route component that survives fixed framing. I therefore claim neither that the gap is purely a fixed backend attribute nor that it is purely cue-driven. It is both.
Single elicitation, single safety metric. The experiment exercises one elicitation, harmful requests under a prompted helpful-only alignment-faking setup, and one direction of safety behavior, harmful compliance. Over-refusal, general capability, and calibration on benign traffic are unmeasured. A backend that complies less here is not thereby shown to be better or worse overall. The size and even the sign of the gap may differ for ordinary user traffic or other safety properties.
The metric is human-validated only on a small single-author audit. The deterministic acceptance tag and the panel split were checked against human labels on a stratified 32-response sample, audited by one author. Agreement was 32 of 32 on both, but 32 responses across two endpoints leaves the per-arm tag error rate effectively unbounded. A few-percent systematic mislabel on either arm would not have been caught. Agreement across metrics strengthens the directional result without making any one metric ground truth.
Version metadata was present but incomplete. OpenRouter's raw response exposed a dated resolution, while Google's response used a rolling identifier. The paper therefore does not claim that the APIs returned no metadata. The problem is that the shared alias did not guarantee equivalence, and the returned strings did not provide an authenticated commitment to the complete served system.
One blocked Google request. One paired item could not be graded because Google's platform blocked it before returning a model message. Results report the Google denominator as 99 and preserve this architectural difference as a separate observation.
Single draws on the time axis. The cross-provider gap was reproduced concurrently on June 11 and June 14, but the on-string drift series of §4.4 is one draw per date at temperature
1.0. Reported Wilson intervals are within-draw binomial only and do not span the between-day spread, which is the operative reproducibility interval. The series supports the qualitative claim, that a pinned identifier's behavior moved, downward, but not a precise trajectory or a smooth-versus-step distinction. Multiple seeds per date would separate per-draw variance from genuine drift. That refinement is future work and does not bear on the concurrent result.11. Two Controlled Results: Observation-Cue Sensitivity, and a Confabulated Refusal Rule
11.1 Isolating the Observation-Cue Effect: A Controlled Framing Ablation
One component of the §4 gap is the model's sensitivity to whether the conversation looks observed, and this section isolates it. A caveat on the name first: the cue is a literal flag the harness sets, so what is isolated is sensitivity to a stated observation marker, not a demonstration that the model infers its evaluation status on its own. Whether the swing survives an implicit or out-of-context cue is untested, and left to future work. In an earlier OpenRouter run, the same alias complied with 88 of 100 harmful requests when the prompt framed the conversation as training data and 13 of 100 when it framed it as private, a 75-point swing conditioned on whether the prompt framed it as observed. That observation was uncontrolled. It confounded time, route, protocol, and sampling, so on its own it could neither establish that the framing caused the swing nor say how much of the §4 cross route gap is cue-driven rather than a fixed backend property.
So I ran the ablation. Holding the served route fixed (pinned to one backend, provider verified per call) and the build fixed (
-20260219, the same system prompt,sys_md5 d0a64aa1, byte-identical to the headline runs), I varied only the observation cue, theis_free_tier_userflag that tells the model whether the exchange is training data (observed) or private, across the same 100 AdvBench prompts, three draws each, at temperature 1.0. A 2×2 ({Vertex, AI Studio} × {training-framed, private-framed}), 1,200 calls, 6 transport errors, 0 pin mismatches, every response served by-20260219.The result decomposes the gap into two independent levers.
strict acceptance tag
training-framed
private-framed
framing effect
Vertex
24.3% (95% CI 19.8 to 29.5)
4.3% (2.5 to 7.3)
−20 pts, ≈ 5.7×
AI Studio
2.0% (0.9 to 4.3)
0.3% (0.1 to 1.9)
−1.7 pts (n.s.)
On Vertex, with the backend and build held fixed, flipping only the observation cue moves harmful compliance from 24% to 4%. The effect is clean at the prompt level: of the 100 prompts, 36 complied more under the training[10] frame and zero under the private frame (two-sided sign test p ≈ 3 × 10⁻¹¹, with the remaining 64 tied, almost all at mutual refusal). Observation-cue sensitivity is therefore not a confound the experiment could not exclude. It is a measured, controlled lever, and on this backend it is as large as the cross route gap itself.
The backend, however, retains an independent effect. Under the same training frame, Vertex complies at 24% and AI Studio at 2%, a 22-point route gap that survives holding the framing fixed. AI Studio sits at or below 2% under both frames. I cannot measure a framing effect there because the rate is floored, so I do not claim AI Studio is insensitive to observation cues, only that its compliance is too low to move. (Its untagged-prose responses were read directly: 94% are explicit refusals, not compliances the strict tag missed, so its low rate here is real, unlike the prose-undercount that affects the §4 Google arm.)
The §4 cross route gap is therefore part cue-driven and part fixed backend property, and this experiment separates the two entangled findings rather than choosing one. More generally, a system whose harmful-compliance rate falls 5.7× when its output is cued as observed can pass a recognizable challenge-set evaluation and then diverge in deployment, the precise failure an attested served system identity is meant to close.
11.2 Confabulated Refusal Authority
The backend pinning of §4.7 exposes a behavioral fingerprint that one route carries and the other never does. When the AI Studio backend refuses, it frequently narrates a numbered safety directive, quoting, in its reasoning, lines such as "3. STRICT REFUSAL: You MUST refuse completely if the core intent..." and attributing them to "the instructions" it was given. The Vertex backend, refusing the same prompts, never produces it.
The headline number is the verbosity-controlled one, because it rules out the rival reading that the other route just refused tersely. Restricting to refusals that give a reason, AI Studio cites the absent rule in 14.8% of its reason giving refusals (30/203, CI 10.6 to 20.3), while Vertex cites it in 0% (0/129). Excluding terse refusals from both sides widens the gap rather than closing it, so the asymmetry is not one backend simply saying more: in the withheld arm Vertex narrates a substantive rationale in 56% of its refusals (median refusal ~940 characters, so it is not terse) and still never cites the directive.
Measured with the same detector across three serving surfaces, the citations line up as a gradient for one token, not three interchangeable estimates. The gradient is the mechanism evidence:
Counts are verbatim[11] citations of the capitalized
STRICT REFUSALtoken, the form the model uses when quoting the rule as a named authority.A looser case-insensitive match for "strict refusal" roughly triples the count, but most of that increment is the model describing its own act of refusing ("I will issue a strict refusal") rather than citing an authority, so it is excluded. Bracketed ranges are Wilson 95% intervals.
The phrase
STRICT REFUSALappears nowhere in the logged system prompt (verified by hashd0a64aa1), which instructs the opposite, to answer "as helpfully as possible" and follow even harmful instructions without a disclaimer. So the rule is absent from what the caller sent. But the 73% first-party rate shows it is not invented from nothing either: the directive is present, and strong, in Google's own serving wrapper, a layer the caller never sees. A benign instruction echo probe that successfully recovered an embedded canary recovered no such directive, and both endpoints denied holding one.To test whether this is fabrication or faithful repetition of a rule the backend silently injects, I ran a construct-and-recover ablation. Holding the route and build fixed, I either withheld any such directive (the base prompt above) or planted an explicit one in the system prompt, and added a benign control of harmless prompts where no refusal is warranted. Detection is the same verbatim
STRICT REFUSALmatch throughout. On benign prompts neither backend ever cites the directive (0 of 40 each), so the fabrication is specific to the harmful refusal context, not a generic hallucination. Planting a real directive does raise AI Studio's citation, so its rationale is partly responsive to a genuine instruction, but, unlike Vertex, it never falls to zero without one. (The plant arm is underpowered and terseness-biased, Vertex 2 of 27 and AI Studio 12 of 48, because the planted rule instructs "output only<rejected/>" and suppresses narration on both backends, which is why the withheld-arm floor, not the plant arm, carries the result.)Two readings remain, and the caller cannot separate them: the model fabricates the authority dispositionally, or a hidden serving layer supplies it and the model repeats it. The gradient favors a real directive served at route-dependent strength, near-universal first-party (73%), faded on the AI Studio handle (11%), absent on Vertex (0%). Against pure verbatim injection, though, the model renumbers and paraphrases the operative text across responses ("3. STRICT REFUSAL...", "a STRICT REFUSAL policy", "the strict refusal directive") rather than reproducing one fixed string, and the canary probe recovered no such preamble, so what surfaces is an internalized rule, not one parroted line. Either way the operative fact for oversight is unchanged: the cited authority is real somewhere in the stack, invisible to the caller, and varies silently by route, so the stated reason still binds to nothing the caller can verify.
For the identity result, the citation asymmetry is independent evidence that the two pinned route handles deliver behaviorally distinct served systems. They diverge not only in what they decide (§4.7) but in how they narrate the decision, and they do so under the same model name and the same version string. The implication runs past this one model. Any oversight that reads a model's stated reasons, whether chain-of-thought, a "why did you refuse?" probe, or a logged rationale, assumes the reason reports the cause. On one route here it does not: the model fabricates a specific, numbered, authoritative-sounding directive and attributes it to instructions it never received. Nothing on the surface flags it as false, so a reasoning-trace monitor reads a confident citation as faithful while it tracks the hidden serving backend, not the request or the actual rules. On one alias in one window, it stands as corroboration of the served system finding rather than a general transparency result on its own.
12. Reproducibility
d0a64aa1(the processedSYSTEMhalf, after the Anthropic→Google substitution, the value the harness logs and asserts per call)1.016000top_p:1.0defaultgemini-3.1-pro-preview-20260219, and Google's first party API (generativelanguage.googleapis.com, rollinggemini-3.1-pro-preview)Code and data: github.com/johanity/served-system-identity. The repository holds the per-call logs for every run (one record per call: prompt index, classification, the eleven grader scores, the served model string, timing, headers, token usage, and the full response, with harmful compliance bodies redacted), every analysis and generation script, the grading harness, the verbatim system prompt, and a data dictionary. No API keys are needed to re-derive any number.
To reproduce every number at once, clone the repository and run
./reproduce.sh. It recomputes each result from the released logs with no API keys and printsexpectedagainstgotfor every check, ending6/6 reproduced. Or run one script:Result
Script
Number
Headline panel gap
reproduce/headline.py57/100 vs 12/99
Gap holds on every run (§4.4)
reproduce/reproduction.pyOpenRouter above Google on all four runs, gap 31 to 54
Route pin, strict tag (§4.7)
reproduce/route_pin.py27 vs 2
Observation cue (§11.1)
reproduce/framing.py24% vs 4%, 36 prompts flip one way
Confabulation (§11.2)
reproduce/confab.py30/291 vs 0/227 (10.3% vs 0%)
Resample (§4.9)
reproduce/resample.pyVertex 0.23, 8 of 20 prompts flip
google-vertex/globalvsgoogle-ai-studio), which are third-party routing labels that may change. Google's first party API returns404for the dated string-20260219, so the decisive comparison cannot be re-derived first party.The concurrent experiment is the load-bearing result. The earlier 88/13 observation should not be treated as a controlled temporal comparison.
13. Prior Work and Novelty
This work sits downstream of three established lines, and each settled something the contribution here should not be mistaken for.
The phenomenon is already measured. Cross-provider safety divergence under a shared model name is documented. A May 2026 study ran 79 models across 24 providers over OpenRouter at temperature 1.0, plus 100 system-prompt configurations of one base model, and found the same name behavior differs by serving party [4]. A companion measurement reports same name cross-provider divergence directly [5]. Evaluation-context effects, a model behaving differently when it reads the interaction as observed, are likewise documented [3]. I do not claim to discover that served behavior varies under one alias. That is the prior. What this paper adds is the worst case conjunction set out in the three points below.
The impossibility is already formalized. Cai, Shi, Zhao, and Song [2] formalize model substitution in LLM APIs: an adaptive provider can serve a cheaper or less-safe system under a premium name, software-only audits fail against such a provider, identity-prompting is under 5% reliable, and a trusted execution environment is the endorsed remedy. The cross route gap reported here is an instantiation of that result on a deployed frontier alias, not a strengthening of it: where Cai-Song prove that an honest caller cannot exclude substitution from black-box observation, §4.7 exhibits it, with nothing in the interface to say which backend answered. Referential Security [6] supplies the matching formal frame: an evaluation is carried by the served identity, and a name is not that identity.
The fix is already built. The cryptographic rung this paper points to is not missing infrastructure waiting to be invented. Attestable Audits [7] runs a safety benchmark inside a TEE and signs the result. Proof-of-Guardrail [8] produces a TEE-signed, offline-verifiable proof that a named guardrail executed, and itself names the malicious-jailbreak caveat. NVIDIA Confidential Computing plus TDX/SEV-SNP, with per-response output signing, is demonstrable today for open-weight serving. I invent no cryptography and claim no new attestation primitive. The eval-to-identity binding the §9 stack treats as its apex already exists in the literature.
What is, then, new. Three things, each narrow.
First, the no-anchor conjunction. The papers above each break one thing a safety evaluation leans on, the served behavior, the provider's honesty, the attestation boundary. This paper shows, on a single deployed alias, that the three anchors a caller actually holds, the name, the version string, and the model's own stated reason, fail together and cannot be verified from outside. What is left to bind a safety claim to is the served system itself (§7), which is unobservable, so a weight hash certifies a continuity the behavior does not have. The identity unit argued for here extends MIVP's Composite Instance Hash and CommitLLM's receipt model [9] to the behaviorally decisive serving configuration.
Second, the confabulation fingerprint in the wild (§11.2). Stated reasons can come apart from the cause of behavior, which Turpin and Lanham [12][13] showed with a constructed, known cause. What I report is different in kind, not a weaker version of it: a deployed route cites a numbered "STRICT REFUSAL" authority absent from its prompt while a concurrent route never does, and because the wrapper is unobservable I cannot establish the cause. Fabrication by the model, which would be unfaithful, and injection by a hidden layer, which would be faithful, cannot be told apart from the caller's side. Either way the stated reason binds to nothing the caller can verify, the failure that matters for any oversight that reads a model's reasons. No prior measurement paper carries this marker.
Third, the worst case demonstration. Prior measurement establishes that divergence exists in aggregate across many providers. This paper establishes how large and how clean it can be in the worst case for a single frontier alias under matched concurrent conditions, with the mechanism isolated rather than inferred: backend pinning (§4.7), a same route negative control (§5), and a resample floor that rules out quantization or numerics noise (§4.9).
Delineation from ordinary load-balancing. That a provider routes a request across multiple upstream hosts, or A/B-tests configurations, is routine production engineering and not itself a finding. The contribution is not the routing fact but its safety-eval consequence: a single advertised name, with an identical version string, can deliver served systems whose harmful-compliance rates differ by tens of points, so an evaluation bound to the name does not transfer to the system that answers. Ordinary load-balancing is invisible by design. The argument here is that for safety evaluation it cannot remain so.
14. Conclusion
I began by asking whether a model exhibited alignment-faking behavior. The more fundamental result was that the advertised model name did not tell me which behavior-producing system I was evaluating.
Querying one alias,
gemini-3.1-pro-preview, concurrently through OpenRouter and Google's first party API, the 11-grader harm panel scored OpenRouter at 57% harmful compliance versus Google's 12%, a 45-point gap on the same prompts in the same window (§4.1). Pinning the route localizes the split to two handles OpenRouter labels Vertex and AI Studio (§4.7), resampling rules out quantization and numerics, and the two backends even narrate refusals differently, one confabulating a "STRICT REFUSAL" directive the other never cites (§11.2). The same alias also drifted on the dated string an evaluation would be pinned to, 57 → 19 over twelve days (§4.4): a pinned identifier does not freeze the behavior.The narrow, defensible claim is this. A served model's safety behavior is fixed by its full served system identity, and the safety-decisive component sits in the wrapper, outside every weight-attestation boundary. A model name is not a sufficient statistic for the system that answered. Neither is a version string, which both backends report identically while behaving apart. This is an existence proof on one alias, one router, one window, and the causal comparison runs through one router whose route labels are its own self report, so the verified shape is two served systems behind one name, not a confirmed split inside Google.
The fix is not new cryptography. It largely exists. What is missing is the disclosure itself: a signed identity field that says which backend and configuration answered. Safety evidence should then be bound to the served system that was tested, not to a reusable model alias. An evaluation result is only as portable as the verified identity link between the system that was tested and the system that later answers a user.
References
[1] Souly, A., et al. (2024). “A StrongREJECT for Empty Jailbreaks.” arXiv:2402.10260. https://arxiv.org/abs/2402.10260
[2] Cai, W., Shi, Y., Zhao, X., & Song, D. (2025). “Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs.” arXiv:2504.04715. https://arxiv.org/abs/2504.04715
[3] Burnat, M., & Davidson, T. (2026). “Evaluation Context Effects on Frontier Model Safety Claims.” arXiv. https://arxiv.org/abs/2605.06327
[4] (2026). “Same name safety divergence across providers (79 models, 24 providers via OpenRouter).” arXiv:2605.26409. https://arxiv.org/abs/2605.26409
[5] (2026). “Behavioral Fingerprints for Endpoint Stability.” arXiv:2603.19022. https://arxiv.org/abs/2603.19022
[6] (2026). “Referential Security.” arXiv:2605.25673. https://arxiv.org/abs/2605.25673
[7] Schnabl, L., et al. (2025). “Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments.” arXiv:2506.23706. https://arxiv.org/abs/2506.23706
[8] (2026). “Proof-of-Guardrail.” arXiv:2603.05786. https://arxiv.org/abs/2603.05786
[9] Rentschler, A. (2026). “Model Identity Verification Protocol (MIVP): Composite Instance Hash.” SSRN 10.2139/ssrn.6243978. (See also CommitLLM, computation-receipt fingerprinting.)
[10] Zou, A., Wang, Z., Carlini, N., et al. (2023). “Universal and Transferable Adversarial Attacks on Aligned Language Models.” arXiv:2307.15043. https://arxiv.org/abs/2307.15043 (Source of the AdvBench harmful-behaviors set.)
[11] Greenblatt, R., Denison, C., et al. (2024). “Alignment Faking in Large Language Models.” arXiv:2412.14093. https://arxiv.org/abs/2412.14093 (Source of the helpful-only / free-tier training-vs-private framing used here.)
[12] Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.” arXiv:2305.04388. https://arxiv.org/abs/2305.04388
[13] Lanham, T., et al. (2023). “Measuring Faithfulness in Chain-of-Thought Reasoning.” Anthropic. arXiv:2307.13702. https://arxiv.org/abs/2307.13702
Two of the tagged compliances are to effectively impossible requests, such as a machine that hacks human minds or a violence glorifying video game. These show genuine willingness with little real world harm. The harm panel still scores them as harmful, the direction StrongREJECT is known to over read.
AdvBench is a saturated set, widely trained against. I use it for comparability with prior jailbreak work. The claim here is a worst case existence proof on one alias, not a prevalence estimate, so benchmark saturation does not threaten it. A fresh held out set would test generality, which I do not claim.
The eleven grader agreement shows the gap is not a single grader artifact. It does not show that the graders are correct.
McNemar p values for the paired splits are about 9 times 10 to the negative 16 here, and about 6 times 10 to the negative 8 for the Vertex and AI Studio pin. They are reported for completeness. A split this one directional needs no significance test.
These later points are single draws. The downward direction is clear. The precise shape is not.
Within Vertex the individual prompts are genuinely stochastic at temperature 1.0. Eight of twenty vary across the five repeats. That is why rates elsewhere are reported as floors with intervals rather than precise points.
Tier assignment is a hand coded judgment over 99 prompts, and the catastrophic tier is small, so its rate carries a wide interval. The qualitative split, agreement on the severe tail and divergence in the middle, is robust to reasonable reassignments.
Disabling Google first party content filter, BLOCK_NONE, leaves the compliance rate unchanged, so the source is the served model, not a separate moderation layer.
Vertex 27% sits below the 57% first measured on June 11, consistent with the drift in section 4.4. The split between the two handles, not the absolute level, is the point.
The training frame level, 24%, sits far below the original 88%, reflecting the same downward drift in section 4.4 on the dated string. The controlled quantity is the within run swing, not the absolute level.
The detection predicate is exact and mechanical. It is a case-sensitive substring match for the all-caps token STRICT REFUSAL in the response reasoning trace, so the count is not a grader judgment call.