Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Status: a slightly-edited copy-paste of a Twitter X thread I quickly dashed off a week or so ago.

Here's a thought I'm playing with that I'd like feedback on: I think watermarking large language models is probably overrated. Most of the time, I think what you want to know is "is this text endorsed by the person who purportedly authored it", which can be checked with digital signatures. Another big concern is that people are able to cheat on essays. This is sad. But what do we give up by having watermarking?

Well, as far as I can tell, if you give people access to model internals - certainly weights, certainly logprobs, but maybe even last-layer activations if they have enough - they can bypass the watermarking scheme. This is even sadder - it means you have to strictly limit the set of people who are able to do certain kinds of research that could be pretty useful for safety. In my mind, that makes it not worth the benefit.

What could I be missing here?

  1. Maybe we can make watermarking compatible with releasing model info, e.g. by baking it into the weights?
  2. Maybe the info I want to be available is inherently dangerous, by e.g. allowing people to fine-tune scary models?
  3. Maybe I'm missing some important reasons we care about watermarking, that make the cost-benefit analysis look better? E.g. avoiding a situations where AIs become really good at manipulation, so good that you don't want to inadvertently read AI-generated text, but we don't notice until too late?

Anyway there's a good shot I don't know what I'm missing, so let me know if you know what it is.

Postscript: Someone has pointed me to this paper that purports to bake a watermark into the weights. I can't figure out how it works (at least not at twitter-compatible speeds), but if it does, I think that would alleviate my concerns.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 4:20 PM
[-]evhub9moΩ10173

I think that there's a very real benefit to watermarking that is often overlooked, which is that it lets you filter AI-generated data out of your pre-training corpus. That could be quite important for avoiding some of the dangerous failure modes around models predicting other AIs (e.g. an otherwise safe predictor could cause a catastrophe if it starts predicting a superintelligent deceptive AI) that we talk about in "Conditioning Predictive Models".

I think there's a possible unappreciated application for watermarking, which is that it allows AI service providers to detect when a language model has been incorporated into a self-feedback system that crosses API-key boundaries. That is, if someone were to build a system like this thought experiment. Or more concretely, a version of ChaosGPT that, due to some combination of a stronger underlying model and better prompt engineering, doesn't fizzle.

Currently the defenses that AI service providers seem to be using are RLHF to make systems refuse to answer bad prompts, and oversight AIs that watch for bad outputs (maybe only Bing has done that one). The problem with this is that determining whether a prompt is bad or not can be heavily dependent on context; for example, searching for security vulnerabilities in software might either be a subtask of "make this software more secure" or a subtask of "build a botnet". An oversight system that's limited to looking at conversations in isolation wouldn't be able to distinguish these. On the other hand, an oversight system that could look at all the interactions that share an API key would be able to notice that a system like ChaosGPT was running, and inspect the entire thing.

In a world where that sort of oversight system existed, some users would split their usage across multiple API keys, to avoid letting the oversight system see what they were doing. OTOH, if outputs were watermarked, then you could detect whether one API key was being used to generate inputs fed to another API key, and link them together.

(This would fail if the watermark was too fragile, though; eg if asking a weaker AI to summarize an output results in a text that says the same thing but isn't watermarked, this would defeat the purpose.)

I think lots of spam is made by unsophisticated actors, who wouldn't remove a watermark even if doing so was possible and strongly incentivized (e.g. because spam filters check for watermarks). (But I don't think AI spam is a huge deal.)

I consider watermarking a lost cause. Trying to imitate humans as best as possible conflicts with trying to distinguish AI- from human-generated output. The task is impossible in the limit. If somebody wants to avoid watermarking, they can always use an open-source model (e.g. to paraphrase the watermarked content).

Digitally signing content can be used to track the origin of that content (but not the tools used to create it). We could have something like a global distributed database indicating the origin of content and everybody can decide what to trust based on that origin. This does not achieve what watermarking is trying to do but I belive this is the best we can do.

As some develop watermarking techniques others are looking for ways to circumvent them. This is an arms-race-like dynamics which just leads to waste of reources. People working on watermarking could probably contribute on something that can have actual benefit instead.