Watermarking AI Text with Markov Chains: What Works and What Doesn't

Kyuhee Kim; ChenghengLi

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

By Chengheng Li Chen (AI Safety Barcelona / EPFL) and Kyuhee Kim (MATS / EPFL)

Crossposted from the Apart Research Technical AI Governance Challenge, March 2026. Code available at github.com/ChenghengLi/MCLW.

Epistemic status: Our results are empirically and theoretically sound. However, paraphrasing attacks from other AI systems may still bypass the watermark. We will explore improvements in future work.

TL;DR: We built a tool that can secretly flag texts made by large language models, without needing access to the model itself. It works by limiting the vocabulary using a pattern based on cryptographic hashes. In our tests, we detected all flagged texts with no false positives, quickly, and using just one hash function. This tool could help meet the EU AI Act requirements starting in August 2026. However, we haven't yet tested how well it holds up against smart paraphrasing attacks.

These days, it can be difficult to know whether a text was written by a person or by AI. For instance, an article that appears to be real journalism might actually be AI-generated and could use unreliable sources. This can damage public trust, journalistic standards, and even democracy.

The European Commission recognized this issue and created Article 50 of the EU AI Act. The article is about the need to build systems that let people easily check whether content was generated by AI. Also, in December 2025, a draft Code of Practice was released with more detailed requirements for multiple layers of protection: metadata, watermarks, and detection tools. Here, they noted that any solution needs to be reliable (low error rates), accessible(verifiable by third parties without the original model), robust (survives text edits), and consistent across topics and writing styles.

In respect of these standards, existing detectors such as DetectGPT and GPTZero, which analyze statistical patterns, can't provide mathematical guarantees. Their error rates depend on writing style, topic, and length. Watermarking approaches like the green/red list method are more reliable, but their robustness against editing hasn't been proven.

Our objective was to develop something more rigorous that can meet the standards: a watermark that can't be forged without a secret key, provides explicit error guarantees, enables verification without model access, and degrades predictably under text modifications. This led to the Markov Chain Lock watermark, or MCL.

How MCL works

Most text AIs use autoregressive models, which generate a small piece of text at a time, called a token. MCL exploits this to embed a watermark. Here's how it works, illustrated with a simple model with a 12-word vocabulary (Figure 1).

Step 1: Split the vocabulary into secret groups. Using a secret key and a cryptographic hash (SHA-256), each word is put into one of several groups. For example, with four groups, you might get: blue {the, is, in}, orange {a, of, to}, green {and, for, that}, purple {on, with, by}. The important thing is that without the secret key, the groupings seem random. If you change the key, all the group assignments change as well.

Step 2: Add a hidden cycle during text generation. We set a rule so that each group can only move to certain other groups. Which groups are allowed depends on the pattern we pick. The simplest pattern, called the "clockwork" pattern, makes the model cycle through the groups in order: 0 to 1 to 2 to 3 to 0, and so on.

In Figure 1, this would look like:

The model starts in state 0 and picks “the” from . The cycle moves to state 1.
Now it must pick from . It chooses “a”. The cycle moves to state 2.
Next, it must pick from . It chooses “and”. The cycle moves to state 3.
Finally, it must pick from . It chooses “on”. The cycle moves back to state 0.

In practice, we can use looser rules. For example, instead of always picking the next group in the cycle, the model can sometimes skip ahead by one group. So from group 0, it could choose from group 1 or group 2. We call this the “soft cycle.” This gives the model more flexibility but still leaves a clear pattern we can detect later.

Step 3: Detect the watermark. To check if the watermark is there, you need to take the text, use the same hash and secret key, and count how many pairs of words follow the expected cycle. That's what we call fingerprint score () and it follows like this:

$𝟙$

(where is the list of words and is the number of words)

So, in other words, the score is just the fraction of word pairs that match the expected pattern. For example, in the case of "the a and on", since each word is in group 0, 1, 2, and 3, and three transitions match, the score is 3 out of 3, or 1.0.

When you use this on text without a watermark, the score averages around k/S, because words fall into random groups and most pairs don’t follow the cycle by chance. In our best setting (S=7, soft cycle), this baseline is about 2/7, while watermarked text scores close to 1.0. This difference is big enough for perfect detection, as our experiments show.

The best part is that this method doesn’t require access to the original model. You only need the secret key and a simple hash function. A regulator could check thousands of documents on an ordinary laptop.

What we can prove

MCL isn’t just something that happens to work in our tests. We can actually prove four things about it.

First, false alarms become basically impossible for longer texts. The probability of accidentally flagging a normal human-written text as AI-generated drops fast as the text gets longer. It can be proven using a standard statistical concentration bound (Hoeffding). The False Positive Rate (FPR) shrinks exponentially with text length as follows:

(where is the number of tokens in the text, is the threshold we use to decide “watermarked or not” (set it at 0.5). A random unwatermarked text would score the ratio / on average, where is how many groups are allowed at each step, and is the total number of groups.)

At 100 words, this gives less than a 0.011% chance of a false alarm; at 150 words, less than 0.0000012%. For anything longer than a short paragraph, false positives are almost impossible (Figure 3).

Figure 2: False positive rate drops exponentially with text length. Each line shows how FPR shrinks as text gets longer, for different state counts. The green dots mark two key milestones on the S=7 curve: at 100 tokens, FPR is already below 0.01% (≤ 1.1×10⁻⁴), and at 200 tokens it becomes vanishingly small (≤ 1.2×10⁻⁸). The dashed line marks the 0.01% reference level.

Second, we know exactly how much editing the watermark survives. If someone swaps out a fraction of the tokens, the expected fingerprint score is:

(where is the fraction of words that someone changed (0 means no edits, 1 means every word was replaced), is how many groups are allowed at each step, and is the total number of groups.)

The first part of the formula captures the word pairs that survived untouched; the second part accounts for the fact that some of the new random words might accidentally follow the cycle just by luck.

Even if someone randomly swaps words, they would need to change at least 45% of them before the watermark disappears when S = 7. In practice, this means rewriting almost the whole text. The bigger challenge is modern AI paraphrasing tools, which can rewrite sentences smoothly without obvious edits. These might break the watermark with fewer changes. We discuss this further in the Limitations section.

Third, the watermark can't be forged or removed without the key. The secret key is the only way to know which group each token belongs to. Without it, the group assignments look completely random, so an attacker cannot reconstruct the pattern or produce a valid watermark.

Last, text quality takes a small, bounded hit. Restricting the vocabulary obviously affects text quality, but we can predict the cost, which grows only proportionally to log(S). On the reader's side, this is noticeable but not dramatic, and adding more groups makes it worse only slightly. In our experiments, perplexity increased from about 3.8 for unconstrained text to 4.2 with S=7.

Experimental Results

We tested MCL using Llama-3.2-3B-Instruct (with ~128,000 tokens) and the soft cycle (k=2) on 173 Wikipedia prompts spanning science, history, tech, and culture, each generating up to 200 tokens (~150 words). Across 28 configurations (7 group counts and 4 overlap levels ), the evaluation tracked fingerprint scores, detection rate, false alarm rate, and perplexity. Partial results on Mistral-7B and GPT-2-XL are also available in the Appendix.

Here, the overlap level indicates how much vocabulary groups can share tokens. At 0% overlap, each token belongs to exactly one group, thereby strictly enforcing the cryptographic constraint. Higher overlap gives the model more flexibility but weakens the watermark signal quickly.

With S ≥ 5 and 0% overlap, every watermarked text scored higher than every non-watermarked text across all 173 prompts. A single threshold at 0.5 produced zero false positives and zero missed detections (Figure 4). Among all configurations that achieve perfect detection, S=7 produces the most natural-sounding text (perplexity 4.2), making it the best overall setting (Figure 5).

Figure 3: Perfect separation between watermarked and non-watermarked text. Each pair of violins shows the score distribution for watermarked (green, left) and non-watermarked (red, right) text. The dashed lines inside each violin mark the mean score. For every state count S ≥ 5, the two distributions are completely separated by the τ = 0.5 threshold, meaning zero false positives and zero missed detections.

Figure 7: Why S=7 is the sweet spot. Left: every state count from S=4 to S=15 achieves 100% detection at 0% overlap. Right: perplexity (text quality cost) varies across configurations, with S=7 producing the most natural-sounding text (lowest PPL of 4.2).

The robustness formulas also held up empirically. Randomly replacing words weakened the watermark exactly as predicted. Detection stayed above 96% even when 30% of tokens were replaced, matching theoretical predictions within 1–6 percentage points (Figure 6). An attacker would need to corrupt nearly half the text to evade detection, at which point the meaning would be destroyed anyway.

Figure 4: Theory matches reality. Left: the theoretical formula (dashed orange) predicts the observed fingerprint scores (blue) with remarkable accuracy, typically within 1-6 percentage points. Right: detection stays above 96% even with 30% of tokens randomly modified. To evade the watermark, an attacker would need to corrupt nearly half the text, destroying its meaning in the process.

In addition to major findings, two failing configurations are worth flagging. One is that using S=2 with the soft cycle produces no signal at all. Since every transition is allowed, there's nothing to detect. The other is that any overlap between groups degrades performance sharply: even 5% overlap at S=7 drops detection from 100% to 92%. Once tokens can belong to multiple groups, the cryptographic constraint no longer holds (Figure 7). This means MCL must be run at 0% overlap, but for Llama-3.2, the model still has roughly 18,000 tokens available at each step. So output quality is generally fine, though edge cases like code generation can be affected.

Figure 5: The overlap cliff. As the overlap ratio increases (x-axis), both the detection rate (green) and average score (blue dashed) drop steeply. At 0% overlap, detection is perfect. At just 5% overlap, it’s already down to 92%. The shaded red zone marks the “cliff region” where the cryptographic constraint breaks down.

Limitations and open questions

Three issues deserve upfront acknowledgment.

Tokenizer coordination. MCL operates at the token level, and different models tokenize text differently. To verify a watermark, you need to know which tokenizer was used. The technical fix is straightforward: include the tokenizer information in the metadata (using C2PA or something similar). But getting all AI companies to agree on a metadata format before August 2026? That’s a coordination problem, not a technical one, and those are often harder to solve.

Untested against intelligent attacks. Our robustness tests only swapped out random words. A determined attacker would likely use another AI to paraphrase the text, which is easy to do and works well against most token-level watermarks. Our math covers any kind of token change, so the formulas still hold no matter how the text is edited. But we haven’t actually tested against AI-powered paraphrasing yet. The 30% random-replacement survival rate is promising, but it doesn’t show what happens if another LLM rewrites the text. That's the most important open question, and we plan to address it in follow-up work (including testing against GPT-5 and Claude).

The overlap cliff. The sudden drop from perfect performance to barely working between 0% and 5% overlap needs more study. It may be possible to relax the cryptographic constraint into something softer that handles overlap more gracefully.

One promising direction is sparse watermarking. Instead of applying the rule to every word, we’d use it only every few words. This lets the AI write more freely and makes the text sound more natural, but the watermark would still remain.

Conclusion

Our methodology achieves 100% detection and 0% false alarms with a perplexity of 4.20 (S=7). Also, this is not only empirically working but also theoretically solid. It has exponentially low false positives, exact robustness formulas, cryptographic unforgeability, and bounded quality costs.

Of course, MCL alone doesn't satisfy every Article 50 requirement. You would still need metadata (like C2PA) and a backup statistical detector. But being able to verify without access to the original AI model is a real advantage, and we hope it will help with compliance as the August 2026 deadline approaches.

Thanks to Apart Research for hosting the Technical AI Governance Challenge.

You can check our code here: github.com/ChenghengLi/MCLW

Appendix A: Score separability

In order to see if there can be confusion between watermarked texts and non-watermarked ones, we compared the lowest score of the watermarked samples against the highest score of the non-watermarked samples in every setting with at overlap. In every case, there is complete separation between the pairs. (Figure 8).

Figure 9: Complete score separation across all configurations. Each horizontal bar connects the worst-case scores: the red dot is the highest score any non-watermarked text achieved, and the green dot is the lowest score any watermarked text received. In every configuration, there’s a clear gap between the two, meaning a single threshold perfectly separates them.

Appendix B: Cross-model validation

To check MCL's generalization, we ran the same 173 prompts on Mistral-7B-v0.3 and GPT-2-XL.

All three models hit 100% detection at S=7 (Figure 9). The bigger model (Mistral-7B) handled the vocabulary restriction much better, reaching perplexity of just 1.71 compared to 4.20 for Llama, which follows naturally, since a larger model has more ways to work around the constraint while still producing coherent text.

Figure 10: MCL works across different model architectures. We tested the S=7 configuration on three models of different sizes: Llama-3.2 (3B), Mistral (7B), and GPT-2-XL (1.5B). All three achieve 100% detection. Larger models naturally handle the vocabulary constraint better, resulting in lower perplexity.