Watermarks Are Statistical Signals, Not Authenticity Layers

Henri Van Ryn

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Epistemic status: confident in the institutional diagnosis; more cautious about the precise technical formalization and the implementability of the proposed guardrails. This post develops the institutional claim; a separate technical paper, “Statistical Horizon,” formalizes the sample-size scaling argument.

TL;DR: AI watermarking should be treated as conditional statistical evidence, not as proof of authorship, intent, or authenticity. The central failure mode is evidentiary tier inflation: institutions promote weak or context-dependent signals from “reason to investigate” into “reason to presume or punish” without carrying forward the assumptions under which those signals were calibrated. Watermarks may be useful for triage and corroboration, but Tier 3 or Tier 4 use requires provenance, calibration, transformation history, base-rate accounting, and disclosure.

I think this is relevant to LessWrong because it treats AI watermarking as an epistemic infrastructure problem: how probabilistic signals get interpreted, miscalibrated, and promoted into institutional decisions.

The policy risk around AI watermarking is not only technical. It is institutional.

A statistical signal can be useful for triage, platform governance, or provenance workflows while still being too fragile to bear high-stakes legal or adjudicative weight on its own. The mistake is not deploying weak signals. The mistake is forgetting that they are weak — and then treating detector outputs as if they establish what they cannot structurally establish.

This post distinguishes three instruments that sometimes get conflated, engages prior LessWrong work on watermarking, and argues that the institutional leap from “informative signal” to “proof of authorship or intent” is a failure mode worth worrying about.

A claim I am not making: I am not arguing that watermarking is useless, that all schemes are equally brittle, or that probabilistic evidence can never be used in adjudication. Courts and institutions use probabilistic evidence routinely — DNA, fingerprints, statistical likelihood ratios. The problem is not that watermark evidence is probabilistic. The problem is that detector outputs often get promoted to a higher evidentiary tier than their calibration, sample conditions, and transformation history can support.

Three things that are sometimes conflated

Cryptographic provenance is a record that a credentialed system, account, or device signed a claim about content at a specific time. C2PA metadata, platform-level digital signatures, chain-of-custody logs — these can be strong evidence for the specific claims they actually make: that a particular actor signed this content, at this time, under this key. They do not by themselves establish authorship in the full social sense, intent, consent, or that the signing entity was uncompromised. Cryptographic provenance is a bounded but strong instrument: strong within its claims, silent outside them.

Statistical watermarking is a probabilistic signal embedded in generated text or images during production, detectable by an algorithm that knows the scheme. It is not a structural claim. It is a distributional one: the output has statistical features unlikely under a specified null distribution, and their presence is evidence — in the likelihood-ratio sense, not a self-contained attribution claim — that the watermarking scheme was involved. A likelihood ratio can in principle be very strong; the practical question is what effect sizes are achievable under realistic sample lengths and adversarial conditions. Current schemes differ substantially in robustness, distortion, key secrecy requirements, and adversarial resistance; I use “statistical watermarking” broadly here while acknowledging that range.

Post hoc AI detection is a classifier trained to distinguish AI-generated from human-generated content, operating on outputs it had no hand in producing. It makes inferences from surface features with no cryptographic or causal relationship to the generating event.

These are not variations on the same instrument. They answer different questions, fail in different ways, and carry different evidentiary weight. Treating them as interchangeable is a step toward institutional overconfidence.

What prior LessWrong discussion already gets right

Three posts here have mapped most of the technical terrain.

Daniel Filan’s “Watermarking considered overrated?” (2023) argues that what most people actually want — authorship endorsement — is better handled by digital signatures than by statistical watermarking. I agree. The corollary I want to add: even when watermarking is not being used for endorsement, institutions may still treat detection outputs as if they establish origin or intent. The mismatch between what the tool provides and what institutions want from it persists even when everyone nominally understands the distinction.

Shankar Sivarajan’s “Watermarks: Signing, Branding, and Boobytrapping” (2024) offers a useful taxonomy organized around the mark-maker’s purpose: signing, branding, and boobytrapping. My three-way distinction is organized instead around evidentiary weight — what the signal can and cannot establish in an adjudicative context. Both are useful; they answer slightly different questions.

egor.timatkov’s “Is Text Watermarking a lost cause?” (2024) is the closest technical predecessor to what I want to say here. It derives the relationship between marking density, adversarial tampering, and the sample size required for confident detection. The core finding: if you want a watermark that is both robust to tampering and decisive in detection, the required number of markings — and therefore the required text length — can rise steeply. Quality constraints, adversarial rephrasing, and finite marking density all work against simultaneous robustness and brevity.

To make this concrete: in green-list watermark schemes such as Kirchenbauer et al. 2023, maintaining a fixed false-positive threshold under token substitution can require substantially more text than clean-condition detection. In the companion paper, I estimate that under a representative parameterization, a 20% token-substitution model can raise the required token count by roughly a factor of five. The important point for this post is not that multiplier, but the direction: transformation exposure increases the sample size required for a detector result to carry the same evidentiary weight.<sup>1</sup>

More robust schemes — e.g., distortion-free or semantic watermarks — shift the curve but do not eliminate the underlying sample-size and transformation constraints. The tier problem scales with required confidence, not absolute robustness.

I rely on this insight. My extension is institutional: these detection constraints matter because legal and administrative systems typically want binary classifications applied to short, real-world excerpts — not clean, controlled long-form outputs. The gap between what robust watermark detection requires and what institutions actually present for detection is structural, not incidental.

The institutional leap

A watermark detector returning a positive result tells you: this output has distributional features consistent with outputs produced under this watermarking scheme, conditional on the detector’s calibration assumptions.

It does not tell you:

who prompted the system
whether the text was subsequently edited, paraphrased, excerpted, or translated
whether the watermark survived those transformations at the required detection threshold
whether a different system generated text that mimics the watermark signal
whether the text is mixed-authorship

Each of these complications is routine. Academic papers get revised. Legal documents get drafted collaboratively. Journalism gets edited. Social media posts get rephrased. In each case, the conditions under which a specific detector’s calibration claims hold may be partly or entirely absent.

This makes transformation history the central hidden variable. Calibration claims depend on a model of how the content moved from generation to detection: clean output, light editing, paraphrase, translation, summarization, token substitution, mixed authorship, or something else. Institutions often apply detectors to artifacts whose transformation history is unknown, unlogged, or unknowable. In that setting, the reported error rate may be answering a different question than the one the institution believes it is asking.

The institutional failure mode is specific: a signal that is informative under clean conditions gets promoted to a higher evidentiary tier than its calibration supports. This happens partly because institutions require thresholds and legal systems require classifications. Likelihood ratios are uncomfortable to operationalize in administrative settings. A binary output from a detection tool is administratively convenient — it fits procurement dashboards, compliance checkboxes, and nontechnical adjudicator workflows. Administrative convenience is a real institutional force that systematically pushes probabilistic signals toward binary use; it is not a defect unique to bad actors.

This pattern — what I’ll call evidentiary tier inflation — is the promotion of a statistical signal from “reason to investigate” to “reason to presume or punish” without carrying forward the assumptions under which that weight was calibrated.

What does “transformation history” mean in practice? The adversarial framing case — someone deliberately feeding watermarked content to frame a target — is real but narrow. The more consequential failure mode is unintentional contamination. A student writes an original draft, then runs it through a grammar checker that happens to use a watermarked LLM for rephrasing. The output now flags as AI-generated. The institution sees the flag, not the workflow. There is no bad actor, only a broken evidentiary chain. Mixed-authorship scenarios like this are not edge cases; they are the normal operation of a world where AI-assisted writing tools are ubiquitous and their watermarking status is opaque to end users.

The more common risk is not adversarial spoofing but ambient contamination: AI assistance quietly entering ordinary writing workflows through grammar tools, autocomplete, translation, summarization, and revision systems whose watermarking behavior is invisible to the user.

Four evidentiary tiers — and where watermarks fit

Not all institutional uses of evidence are equivalent. A rough working taxonomy:

Tier 1 — Investigative signal: worth examining further; justifies limited inquiry but no adverse action on its own. Non-cooperation with a Tier 1 inquiry should not create a negative inference or be entered into an adverse record.

Tier 2 — Corroboration: supports other evidence; useful as one factor in a multi-signal determination.

Tier 3 — Presumption: sufficient to shift the burden to the responding party without additional corroboration.

Tier 4 — Proof: sustains an adverse finding as a standalone or primary basis.

Tier 3 is not impossible in principle. A watermark result might justify a presumption if the institution can establish the relevant scheme, the detector version, the calibration conditions, the sample length, the false-positive rate for the relevant population, and a transformation history close enough to the detector’s tested assumptions. That might occur in a controlled platform environment with intact logs and minimal transformation. It is much less likely in ordinary adjudication, where the artifact is short, edited, paraphrased, mixed-authorship, or separated from its generation pipeline.

Statistical watermark evidence — under real-world conditions of excerpt length, transformation history, unknown adversary model, and mixed authorship — is typically a Tier 1 or Tier 2 instrument. The institutional failure mode is treating it as Tier 3 or Tier 4: using a detection result to sanction a student, exclude a document, or support a legal finding without explicit accounting for calibration conditions, transformation exposure, and error rates.

Even Tier 1 use is not costless. Being flagged for investigation can impose stress, delay, stigma, and invasive process costs. In low-prevalence or high-contamination environments, Tier 1 use should therefore be limited to low-friction follow-up: aggregate monitoring, nondisruptive review, or requests for ordinary contextual evidence, not immediate accusation or punitive process.

To see why, consider the base-rate problem. A detector with 99% specificity applied to a population where only 1% of documents actually contain watermarks will produce as many false positives as true positives. Even a detector with 99.9% specificity, one false positive per thousand negatives, applied to millions of documents yields thousands of false accusations. At that scale, small error rates become institutional-scale harms.

The injustice is scale-induced: even a detector that looks excellent in isolation can generate a large population of wrongly accused people when deployed across millions of low-prevalence cases.

The relevant specificity is not the clean-validation specificity but the deployment-population specificity: how the detector behaves on the actual population being screened, including benign paraphrase, grammar-tool use, excerpting, translation, and mixed authorship.

In high-prevalence domains — e.g., student essays in an AI-permissive cohort — a positive signal may have high positive predictive value. But the tier framework still applies: even a strong statistical signal does not establish authorship, intent, or transformation history without corroboration.

The difference from mature forensic uses is not that watermark evidence is probabilistic. It is that the relevant calibration conditions are often opaque, transformation history is often unknown, and the target claim is frequently broader than what the signal can support.

The thresholds between tiers are not fixed numbers; they depend on base rates, stakes, and institutional context. The same likelihood ratio may justify very different evidentiary treatment depending on prevalence, stakes, and corroborating context. The point is not to prescribe numbers but to force the question: what would justify moving up a tier?

Cryptographic provenance, where the signing infrastructure is intact and the claimed scope is respected, can reach Tier 3 or Tier 4 — but only for the specific claims it actually makes.

Provenance can also be inflated. A signature may establish that a file hash existed under a given key at a given time; it does not by itself establish authorship, intent, consent, scene authenticity, or absence of later contextual manipulation. The same tier discipline applies: strong evidence for a narrow claim should not be promoted into proof of a broader one.

The post’s central claim is therefore not “watermarking is bad” but: watermark detector outputs are being promoted across evidentiary tiers without carrying forward the assumptions — scheme, detector version, sample length, transformation model, false-positive rate, base rate, and corroborating context — under which their evidentiary weight was calibrated.<sup>2</sup>

A cautionary parallel from a different technology

I do not know of a documented watermark misuse case at this scale; the point is prospective. Institutional safeguards should be designed before generator-side signals acquire unwarranted authority, not after.

ACU used a post-hoc classifier, not a watermark. That difference matters technically. But the failure mode — promoting a Tier 1 statistical signal to Tier 3 or Tier 4 without proper calibration — is structurally similar. Watermarks may be even more vulnerable to this institutional pressure because their generator-side origin can make them seem more authoritative.

The mechanism differs; the institutional failure is the same: a probabilistic indicator is treated as dispositive evidence without conditioning on calibration, base rates, and transformation history.

In 2024, Australian Catholic University recorded nearly 6,000 academic misconduct referrals, about 90% related to alleged AI use, according to internal documents reported by ABC News. Turnitin’s post-hoc AI indicator played a central role in some cases, including cases where students were later cleared. ACU later said cases relying solely on Turnitin’s AI indicator were dismissed, but ABC reported documentation suggesting the reports were still regularly relied on as primary evidence.

This is what Tier 3 treatment of a Tier 1 signal looks like at scale. The analogy is not that watermarking and post-hoc detection have the same error profile, or that watermarking has already failed in the same way at scale. It is that institutions can misread either kind of statistical signal as a higher-tier evidentiary instrument.

The legitimate narrow use case — high-volume platform moderation with calibrated thresholds, explicit error-rate acknowledgment, and no adverse action resting on the signal alone — is precisely what the tier framework is designed to protect.

Implications

Several implementation problems remain open. Tier 1 follow-up must be designed so that “voluntary” cooperation does not become coercive by implication. Population-specific calibration may require pre-deployment sampling rather than merely importing validation numbers from another context. “Independent signals” should mean signals from distinct sources, not multiple artifacts generated by the same workflow. And detector-process disclosure would make many current deployments noncompliant unless institutions maintain version, calibration, and transformation-model records. Those burdens are not incidental; they are part of what it means to use a statistical signal in an adjudicative setting.

If the tier framework is sound, what follows? Four concrete guardrails:

Detector outputs should expose calibrated likelihood ratios, confidence intervals, or score-to-evidentiary mappings under stated assumptions, not merely binary flags. Binary outputs invite binary decisions; calibrated scores force the user to confront uncertainty, base rates, and the conditions under which the tool was validated.
Adverse actions, Tier 3 or 4, should require either cryptographic provenance or multiple independent signals, with documented transformation history. Independent signals should come from different evidentiary or custodial chains, not merely multiple artifacts from the same workflow. Stronger examples include two watermarks using different key material, or a watermark plus independent provenance logs. User-provided draft histories, version-control traces, prompt records, and attestations may be useful context, but should not automatically count as independent from the content itself.
False-positive rates must be reported in terms of population prevalence, not just per-document error rates. A 0.1% FPR sounds tiny until you multiply it by a million documents.
Respondents should have a right to detector-process disclosure before any Tier 3 or Tier 4 use. If an institution uses a watermark or detector output to shift the burden or support an adverse finding, the respondent should be told what scheme or detector was used, what version was run, what calibration conditions it assumes, what false-positive rate applies to the relevant population, what transformation model it was tested against, and whether it was validated for mixed-authorship workflows such as grammar checking, translation, summarization, or revision. If the tool was not calibrated for the workflow at issue, its output should not be eligible for Tier 3 or Tier 4 treatment.

These are not radical. They are standard forensic statistics applied to a new domain.

To be clear: watermarks can be useful for Tier 1 and Tier 2 applications. The argument is about evidentiary tier discipline, not about banning watermarking or defunding watermark research.

The prior work on this site has established solid ground on watermarking’s technical limits. My claim is that technical limits under research conditions understate the institutional problem. The deeper issue — research-calibrated signals meeting institutions that need yes/no answers — is the one this post is about.

Footnotes

Based on analysis of green-list watermarking schemes such as Kirchenbauer et al. 2023. Exact scaling depends on parameters and detection assumptions. The companion paper, “Statistical Horizon: A Scaling Benchmark for Finite-Sample Watermark Detectability,” develops the sample-size scaling argument and applies it across parameter choices. The manuscript dated April 16, 2026 is available at github.com/henrivanryn/statistical-horizon-watermarking.
These categories are qualitative; formal treatment requires explicit adversary models, null distributions, and calibration conditions that vary by scheme. The companion paper develops the formal version.

AI-assisted writing disclosure

I used AI tools extensively during drafting and revision: to stress-test the structure, surface counterarguments, improve LessWrong fit, and polish language. I reviewed the final version and take responsibility for the claims, framing, and errors.