Don’t bring an AI detector to a deepfake fight: proving reality through multimodal provenance

Julien Despois

Rejected for the following reason(s):

Potentially / Partially LLM content.

Read full explanation

TL;DR: Detection-based approaches to fake media are fundamentally reactive and will always lose to generation. Images and videos need to be considered as untrustworthy by default. The right reframe is provenance: not "is this fake?" but "what tells me it’s real?" Cryptographically signed, multimodal capture raises the cost of spoofing exponentially, making casual fakes prohibitively expensive for the general public. Combined with platform policy and legal liability for untagged generated content, this could solve most of the problem of widespread fake media.

Note: this post was written by myself, prose was partly refined with AI writing assistance.

I. The broken epistemic contract

Text is easy to fake, which is inconvenient. Now photos and videos are becoming easy to fake too. That, however, is a much bigger problem. Let's see why.

When we read something supposedly factual, we know that the written words don’t represent reality, as much as someone's opinion, idea, or account of reality. We do not believe the text itself, we believe the information it conveys because we trust its source. Our entire epistemology of written information is built around that assumption: journalism standards, libel law, citation practices, witness testimony.

LLMs make fake text cheaper and faster to produce at scale, and that’s a problem, but not a new category of problem. We know not to blindly trust text, so we already have partial defenses: we cross-reference sources, check authorship, look for corroborating evidence. We are used to questioning any suspicious quote or tweet we encounter because we know how easy it is for anyone to write down words and falsely attribute them to someone.

Pictures, audio recordings, and by extension videos are a whole other kind of beast. We use them, for the most part, to capture the world as it is. This is the foundation of how we use cameras in courtrooms, journalism, insurance claims, criminal investigations. We believe images because they don't lie, or more precisely, because it is expensive and difficult to make them lie convincingly. At least it was. The three barriers that previously made convincing fakes hard (time, cost, and skill) have collapsed simultaneously, and there is no realistic upper bound on how much better generation will get. The contract of audio-visual media as a testimony of real life events is breaking down, and we have almost no infrastructure to replace it.

II. What happens when audiovisual media becomes untrusted

Before getting to the technical argument, it is worth being precise about what is actually at risk, because the public discourse tends to focus on the most visible harms (election interference, celebrity deepfakes, viral misinformation) while underweighting what I think is the more insidious damage.

The legal system depends on visual evidence. A video of an assault, a photograph of an injury, security camera footage of a crime, these are often the only proof of something having happened. Domestic abuse victims, whistleblowers, people documenting police misconduct, for many of them, a phone video is the only evidence they have. As deepfake technology matures, that evidence becomes systematically deniable.

Citron and Chesney, in their 2019 paper on deepfakes, named this the "liar's dividend": the mere existence of deepfake technology gives bad actors a plausible basis to dismiss authentic evidence as fabricated. A defendant who can point to the existence of deepfake technology and say "that video could have been generated" has introduced reasonable doubt, regardless of whether the video was actually manipulated. The damage to epistemic trust is not just about people being misled by fakes, but about real evidence becoming unreliable/inadmissible as a whole.

III. Why detection is the wrong battle

When your enemy builds a bigger cannon, you just get a bigger shield, right?

The instinctive response to fake media has been to build detectors. There are two main ways to detect a fake: either you point out what’s wrong with it (unrealistic output, generation artifacts), or you make sure the generation process includes a fingerprint to tag the content as generated (metadata, pixel manipulation schemes). This is the logic behind watermarking schemes like DeepMind's SynthID, as well as a growing body of academic work and countless tools for AI detection.

Both approaches have flaws, but, more importantly, I believe that this is the wrong frame entirely, and for a fundamental reason.

a. Detection

The problem of detection is that it’s stuck in an arms race, and it’s going to lose.

For humans, it is a game of diminishing tells. When diffusion models became widespread, people looked at generated hands to spot fakes. But hands got better. People then looked at local texture artifacts. And texture got better. Now people look at perspective and lighting consistency, and it would be naive to think this tell won't suffer the same fate.

For automated systems, the problem is more fundamental. Every time a detector is deployed, generators can be adversarially trained not to trip it (this is essentially how GANs work). Automated techniques rely on subtle statistical tells, but they suffer from the same flaw as human tells. They impress because they somewhat work when they’re released, before being made obsolete within months. Each new generation of models eliminates the artifacts that gave the previous one away, and there is no reason to think this process will ever stop.

There is unfortunately no absolute law of physics that mandates that generated content has to be distinguishable from real content, and that it is always possible to build a detector for it. For example, the question of whether the sentence “The cat jumped on the table” has been written by a human or an LLM is fundamentally impossible to answer because that sentence is “perfect”, i.e. indistinguishable from other real content.

If we want to tackle fake content for good, we need to assume that every kind of media generation will eventually become perfect. Even though images, sounds and videos offer much greater expressive potential (and therefore room for generation flaws) than text, we need to work with the premise that we will reach a point where all detectors have become useless^[1].

Perhaps the subtler danger of detection-first thinking is not that it fails, but that while it appears to work, it gives a false sense of safety and displaces the harder conversation about what a durable solution would actually look like.

b. Watermarking

The second way to detect generated content is to add a watermark during the generation process, which comes with its own distinct issues. Watermarks embedded in metadata are trivially stripped (most social media platforms do this automatically when you upload an image) while watermarks embedded in the signal itself can fail against adversarial image manipulation. If a watermark can be cropped out, or can’t survive being compressed, stretched, or deep fried, it is virtually useless^[2].

More fundamentally, watermarking only tags content that comes from a system that applies the watermark. It does nothing about content generated by systems that do not, like most open-source models and every fine-tuned variant running on private infrastructure.

To build a robust system for fighting fake media, we must assume that the generated content will come from non watermarked systems.

IV. A provenance-first approach with a multimodal twist

The deeper issue with both detection and watermarking is that they are asking the wrong question. They ask: "is this fake?", which is becoming increasingly unanswerable. The right one to ask is: "can we prove that it’s real?" Instead of trying to detect fakes, we should focus on certifying what is real.

After going back and forth on this problem for a while, I came to the conclusion that the solution had to involve encryption directly at the capture layer. We need a way to ensure that the content we see has been captured by a real device, and that it hasn't been edited since^[3]. It turns out others had the same instinct: the Birthmark Standard and the Signing Right Away architecture both push in this direction, and C2PA is already shipping in mainstream phones. The latter proposes to cryptographically sign media at the point of capture and attaches a tamper-evident manifest that tracks every subsequent edit.

This is the right direction conceptually but these approaches aren’t leveraged in a meaningful way by content platforms (a certificate is useless if no-one checks it), and remain largely monomodal, making them prone to spoofing. Cryptographically signing an RGB image proves that a particular camera produced a particular file. It does not prove that the certified camera was pointed at a real scene rather than a screen displaying a generated image. With that setup, you could easily create fake media with a valid provenance chain simply by capturing it with a certified device.

To prevent such spoofing, the extension I want to propose is to leverage physical consistency across multiple independent sensors. This would raise the cost of spoofing not linearly but multiplicatively and, as a result, significantly change the threat model. Modern phones and cameras already contain many sensors: multiple RGB cameras, depth sensors, microphones, IMUs (sensors that measure motion and orientation), GPS. Each of these independently captures something about the physical world. Spoofing one sensor is feasible. Spoofing all of them simultaneously, in a physically consistent way, is a much harder problem.

Consider what it would take to fake a video that passes multimodal verification. The RGB stream needs to show a convincing scene. At the same time, the depth sensor needs to show a consistent 3D geometry. The ambient audio needs to match the acoustic properties of the space implied by the depth map. The IMU needs to show motion consistent with how the person holds the phone in the video. GPS needs to place the device at a location consistent with the claimed context. Each additional sensor does not add linearly to the difficulty of spoofing but multiplies it, because the sensors need to be physically consistent with each other, not just individually plausible.

Technical note: the consistency verification model itself could be learned rather than handcrafted: a self-supervised approach trained on large collections of legitimate multimodal recordings, the kind smartphones produce millions of every day, would naturally learn what physical coherence looks like across sensor streams without explicit labels. Crucially, the raw sensor outputs should also be made human-inspectable: a verification interface that surfaces the depth map, IMU trace, and audio envelope alongside the video gives the user the same cross-modal picture, without having to trust a black box.

V. Reality check: this system isn’t infallible, but that's not the point

No matter how technically sound the verification system is, a sufficiently resourced actor (a nation-state, a well-funded intelligence operation) might be able to construct a complex ad-hoc rig that spoofs all of the inputs simultaneously. That is an acceptable risk.

Many security systems share this property. Locks do not stop determined burglars with the right tools, yet we still put locks on doors. The value of a security measure is not that it is unbreakable; it is that it puts the cost of an attack above the threshold that most attackers are willing to pay.

The realistic threat model for fake media I’m concerned with here is not nation-states. It is the person with a grudge and a laptop generating a fake video of their ex-partner. It is the political operative producing low-cost propaganda at scale. It is the person who wants to discredit a witness in a civil case. For all of these actors, multimodal provenance verification raises the cost of a convincing fake from near-zero to prohibitively expensive. That's the win.

VI. What actually needs to happen

The technical provenance layer is necessary but not sufficient on its own. For us to be able to continue using images and videos as a credible snapshot of reality, two more things need to happen:

a. Platform Policy

A multimodal provenance standard is only useful if platforms treat provenance-tagged and untagged content differently. The goal is to create a clear epistemic tier: content with a verified provenance chain is treated differently from content without one, especially in high-stakes contexts like news, legal proceedings, and public health.

Once we expect every image to be linked to a page detailing the level of multimodal verification, we’ll learn to treat media as “unreliable until proven otherwise”, just like we do with text. It might seem exaggerated, as most of the media online might still be genuine (for a time), but I believe it’s the only framing that will still hold when generation becomes perfect.

In this setting, provenance verification isn't a simple binary trust/no-trust flag, but rather an indicator of how trustworthy a piece of media is, based on which captured modalities are available for scrutiny, and which have been verified to be globally coherent.

b. Legal Liability

The provenance system, if widely adopted by platforms, would largely make non-disclosure self-defeating (unverified content about real life events would carry a default trust penalty, or even an algorithmic one preventing it from becoming viral), but even in the best case scenario that adoption is years away. In the meantime, legal liability is the bridge: we need to make it legally costly to present untagged content as factual evidence. This does not mean criminalizing unverified media, but making it so that presenting generated content as authentic carries legal risk, the same way using a forged document already does.

Such a policy is completely orthogonal to the proposed multimodal provenance verification system, and could already be implemented today. There are many interesting uses for generative models, but I can’t find a single one that would justify not disclosing that the content has been AI generated or doctored; similar to ads which are, in some countries, required to disclose if the model has been photoshopped. To be consistent, this should also extend to parody/satirical accounts, under the same umbrella of “presenting fiction as authentic content”. It should be easy and immediate to check the provenance of any content^[4].

Making it costly to present unverified content as factual evidence starting today will help bridge the gap while the infrastructure for multimodal provenance verification is being built. It is not a permanent substitute for the provenance layer but what we need while we wait for it.

Concluding thoughts

With a multimodal provenance verification system coupled with enforced platform compliance and legal liability, I believe that our media-based system has a chance to survive the imminent mass advent of flawless, unwatermarked generated content. Hiding the issue behind reassuring, but temporary and ultimately pernicious AI detectors only contributes to increase the risk of a catastrophic rupture the day our institutions will realise that simple images, videos and recordings can't be trusted anymore.

Note: Right after I had finished writing this article, YouTube announced their effort to combat misinformation and AI generated content. It has some good sides to it, such as adding a visible mention of AI generated content, but also many of the flaws described in this article. In particular, they announce automatically flagging AI-generated videos, which suffers from all the detector issues we discussed. They also integrate C2PA metadata as a permanent disclosure signal, which is the right instinct but remains monomodal and strippable. Most importantly, disclosure is still voluntary for content not created with YouTube's own tools, with no legal consequence for non-compliance, meaning the people most likely to abuse the system are the least likely to use it. A bad actor using an open-source model on private infrastructure produces unwatermarked, unverified content that this system has no handle on. Better than nothing, but not a foundation we can build evidence standards on.

^{^}
Having worked in image generation research for many years, I have witnessed the ridiculous pace at which advances are made. We always tend to overestimate how long it will take for image synthesis to reach a given quality threshold. We haven't reached perfect, flawless generation yet, but we're getting there, and it's coming sooner than we think.
^{^}
DeepMind's SynthID-Image approach of perturbing pixel values in ways imperceptible to humans seems to be relatively robust to perturbations, but they acknowledge that 'achieving perfect security [through watermarking] is impossible'.
^{^}
Or at least that the modifications are properly documented and/or reversible.
^{^}
A reasonable objection is that satire needs to be believable to work, and a disclaimer would kill the joke. But verifiable origin is not the same as mandatory disclosure. A reader who wants to be fooled by The Onion can still be fooled, but a reader who wants to verify whether the news is real needs to be able to do so in one click.