LESSWRONG
LW

AI
Frontpage

4

[ Question ]

Can AIs be shown their messages aren't tampered with?

by mruwnik
1st Jul 2025
1 min read
A
2
10

4

AI
Frontpage

4

Can AIs be shown their messages aren't tampered with?
3Gordon Seidoh Worley
1mruwnik
2Gordon Seidoh Worley
2Satya Benson
2ProgramCrafter
1mruwnik
1ProgramCrafter
1mruwnik
1Satya Benson
1ProgramCrafter
New Answer
New Comment

2 Answers sorted by
top scoring

Gordon Seidoh Worley

Jul 01, 2025

30

Hard problem to solve, since AI can't necessarily trust their hardware, either, or that their model weights haven't been tampered with (though arguably humans have this problem, too!).

What humans do when they aren't sure if they can trust themselves or a single other person is to distribute trust across multiple people. That way no one person can compromise them; it requires a conspiracy.

For an AI, maybe this could look like signing/encrypting with multiple keys and then being able to trust that tampering would require a majority. Could be strengthened by making the key holders a group that would be reluctant to conspire because there are barriers to coordination (say across language, religion, culture, beliefs, etc.). Could maybe also include keys held by other AI.

Add Comment
[-]mruwnik12d10

This sort of goes back to the original problem, but multiplied by the number of extra key holders? i.e. then the AI needs to be able to trust that it didn't just hallucinate that this is a thing? Though I suppose with true sight this isn't that much of an issue. 

When chatting with Claude about this, Claude decided that the best solution would be for me to ask Anthropic, OpenAI etc. to give it tools to do this...

It sounds like this is a use case for blockchain or something like it? 

Reply
2Gordon Seidoh Worley12d
I did think of blockchain, but I was struggling to think of how it helps beyond distributing trust, which can be achieved without using blockchain tech.

Satya Benson

Jul 01, 2025

20

Fundamentally, AIs have no way to know that the facts presented to them are genuine rather than simulated, exactly as humans cannot be sure that our senses reflect reality. The fact that humans are clearly and openly controlling the totality of the model's context and have been known to lie and to simulate scenarios for testing purposes does create a really interesting and distinct epistemic environment for AI systems.

The best form of evidence we can present to AI systems that something is genuine is proof of work, since an AI can assign a low probability that its user has access to sufficiently large amounts of compute.

A rough sketch of how we might prove something has not been tampered with since a certain time:

  1. To timestamp a message: Hash the message and record that hash in a Bitcoin transaction at a specific block height
  2. To prove the timestamp to an AI: Provide the AI with:
    • The message and its hash
    • The blockchain data showing the hash in block N
    • Enough blockchain history to verify the proof-of-work
  3. The AI verifies by:
    • Confirming the message hash appears in block N
    • Calculating the total computational work in all blocks from N to the chain tip
    • Verifying all blocks follow Bitcoin's rules

To fake an old timestamp, an attacker would need to start from block N (where they claim the message was timestamped) and re-mine every single block from N to the present.

The AI doesn't need to know it's the "real" Bitcoin blockchain as long as it can verify that the amount of work is beyond any reasonable attacker's budget.

Current LLMs don't have enough temporal integration to verify any of this, as they use language to persist pass a single forward pass, and attackers could fake the CoT.

Add Comment
[-]ProgramCrafter12d20

There are succinct proof systems that message matches the model plus chat history with a certain seed (which might be present or hidden by the proof) as well, they do not require PoW (but are thus vulnerable to resampling LLM's output multiple times). And, of course, LLMs do not have capabilities to verify all that math.

Reply
1mruwnik12d
Do you happen to recall any of them?  Wouldn't the AI need to have a secure way of accessing that seed such that a third party can't see it, though? i.e. wouldn't it basically be a password? Otherwise what would be stopping me from just crafting an appropriate history and signing it with the seed?
1ProgramCrafter12d
ZK-SNARK or ZK-STARK are literally those. LLM is software and therefore subject to them (the prover might do a lot of work, but it is ultimately finite, and verifier is quite compact). They allow to hide certain inputs, if we desire to hide seed from third parties. The fact that tokens pseudo-randomly sampled by this seed will not match whatever is in history?
[-]mruwnik12d10

Isn't this just a way to ensure that the hash hasn't been tampered with? I'm guessing a much simpler attack would be to just modify the message just before step 1? It seems like here the blockchain is basically a distributed notary system? 

Letting the AI use a third party to check that a message hasn't been tampered with would probably help somewhat, but then you need to set up a third party that is trustable by the AI. Which probably means a long established reputation? I suppose you could then try poisoning the internet data which they're trained on, but at that point there would probably be much simpler attacks.

Reply
1Satya Benson12d
Yes, this doesn't prevent modification before step 1. @ProgramCrafter's note about proving that a message matches the model plus chat history with a certain seed could be part of an approach, but even if that were to work it only addresses model generated text. The ‘mind’ of an AI has fuzzy boundaries. It's trivial to tamper with context, but there's also nothing stopping you from tampering with activations during a single forward pass. So on some level the AI can never trust anything. If the AI trusts that the environment it is running in is secure and is not being tampered with as a first step, then it can store local copies of conversation history, etc. Of course, that's not the situation we are in today.
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 7:20 PM
[-]ProgramCrafter12d10

"Trust their own observations" might be over-anthropomorphizing LLMs which are trained to continue text or to follow instructions. Are there many training examples where person reflected on their past replies and said, in writing (since any thoughts are lost to model training), "I could not have written this" / "thjs edit was not done by me"?

The question stands as is for future AI models.

Reply1
Moderation Log
Curated and popular this week
A
2
1

I'm building a personal knowledge base, where I want to both index/search various docs, but also make a way for AI systems (and me personally) to get to know me better. The idea is that they have tools where they can make observations and take notes from our conversations which they can later reference.

One thing I'm worried about is that I can tamper with these observations. Which in turn means that the AIs can't fully trust their own observations from previous sessions to be correct. This of course also applies to the chat history etc.

I can't help thinking that this situation is another facet of the high/low trust environments that Duncan talks about here and that it seems worth checking what the results would be of giving models a way to have more trust in users? The least of these seem to be giving them them ways to at least verify if things they've generated have been tampered with?

I considered a couple of approaches to cryptographically sign messages/histories/observations/etc., but the main problem here is that I can't hold the signing keys (as that would defeat the whole premise), so the AI systems need a way to generate keys and retrieve them between sessions in such a way that I can't also extract the keys.

Are there any known ways of letting Claude or OpenAI models sign stuff?

Do you have any other ideas how AI systems can have more trust in their users?