New Answer

New Comment

2 Answers sorted by
top scoring

Jul 01, 2025

Hard problem to solve, since AI can't necessarily trust their hardware, either, or that their model weights haven't been tampered with (though arguably humans have this problem, too!).

What humans do when they aren't sure if they can trust themselves or a single other person is to distribute trust across multiple people. That way no one person can compromise them; it requires a conspiracy.

For an AI, maybe this could look like signing/encrypting with multiple keys and then being able to trust that tampering would require a majority. Could be strengthened by making the key holders a group that would be reluctant to conspire because there are barriers to coordination (say across language, religion, culture, beliefs, etc.). Could maybe also include keys held by other AI.

[-]mruwnik4mo10

This sort of goes back to the original problem, but multiplied by the number of extra key holders? i.e. then the AI needs to be able to trust that it didn't just hallucinate that this is a thing? Though I suppose with true sight this isn't that much of an issue.

When chatting with Claude about this, Claude decided that the best solution would be for me to ask Anthropic, OpenAI etc. to give it tools to do this...

It sounds like this is a use case for blockchain or something like it?

2Gordon Seidoh Worley4mo

I did think of blockchain, but I was struggling to think of how it helps beyond distributing trust, which can be achieved without using blockchain tech.

Satya Benson

Jul 01, 2025

Fundamentally, AIs have no way to know that the facts presented to them are genuine rather than simulated, exactly as humans cannot be sure that our senses reflect reality. The fact that humans are clearly and openly controlling the totality of the model's context and have been known to lie and to simulate scenarios for testing purposes does create a really interesting and distinct epistemic environment for AI systems.

The best form of evidence we can present to AI systems that something is genuine is proof of work, since an AI can assign a low probability that its user has access to sufficiently large amounts of compute.

A rough sketch of how we might prove something has not been tampered with since a certain time:

To timestamp a message: Hash the message and record that hash in a Bitcoin transaction at a specific block height
To prove the timestamp to an AI: Provide the AI with:
- The message and its hash
- The blockchain data showing the hash in block N
- Enough blockchain history to verify the proof-of-work
The AI verifies by:
- Confirming the message hash appears in block N
- Calculating the total computational work in all blocks from N to the chain tip
- Verifying all blocks follow Bitcoin's rules

To fake an old timestamp, an attacker would need to start from block N (where they claim the message was timestamped) and re-mine every single block from N to the present.

The AI doesn't need to know it's the "real" Bitcoin blockchain as long as it can verify that the amount of work is beyond any reasonable attacker's budget.

Current LLMs don't have enough temporal integration to verify any of this, as they use language to persist pass a single forward pass, and attackers could fake the CoT.

[-]ProgramCrafter4mo20

There are succinct proof systems that message matches the model plus chat history with a certain seed (which might be present or hidden by the proof) as well, they do not require PoW (but are thus vulnerable to resampling LLM's output multiple times). And, of course, LLMs do not have capabilities to verify all that math.

1mruwnik4mo

Do you happen to recall any of them? Wouldn't the AI need to have a secure way of accessing that seed such that a third party can't see it, though? i.e. wouldn't it basically be a password? Otherwise what would be stopping me from just crafting an appropriate history and signing it with the seed?

1ProgramCrafter4mo

ZK-SNARK or ZK-STARK are literally those. LLM is software and therefore subject to them (the prover might do a lot of work, but it is ultimately finite, and verifier is quite compact). They allow to hide certain inputs, if we desire to hide seed from third parties. The fact that tokens pseudo-randomly sampled by this seed will not match whatever is in history?

[-]mruwnik4mo10

Isn't this just a way to ensure that the hash hasn't been tampered with? I'm guessing a much simpler attack would be to just modify the message just before step 1? It seems like here the blockchain is basically a distributed notary system?

Letting the AI use a third party to check that a message hasn't been tampered with would probably help somewhat, but then you need to set up a third party that is trustable by the AI. Which probably means a long established reputation? I suppose you could then try poisoning the internet data which they're trained on, but at that point there would probably be much simpler attacks.

1Satya Benson4mo

Yes, this doesn't prevent modification before step 1. @ProgramCrafter's note about proving that a message matches the model plus chat history with a certain seed could be part of an approach, but even if that were to work it only addresses model generated text. The ‘mind’ of an AI has fuzzy boundaries. It's trivial to tamper with context, but there's also nothing stopping you from tampering with activations during a single forward pass. So on some level the AI can never trust anything. If the AI trusts that the environment it is running in is secure and is not being tampered with as a first step, then it can store local copies of conversation history, etc. Of course, that's not the situation we are in today.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 6:27 AM

[-]ProgramCrafter4mo10

"Trust their own observations" might be over-anthropomorphizing LLMs which are trained to continue text or to follow instructions. Are there many training examples where person reflected on their past replies and said, in writing (since any thoughts are lost to model training), "I could not have written this" / "thjs edit was not done by me"?

The question stands as is for future AI models.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

4

[ Question ]

Can AIs be shown their messages aren't tampered with?

4

4

2 Answers sorted by
top scoring

Jul 01, 2025

Jul 01, 2025

4

[ Question ]

Can AIs be shown their messages aren't tampered with?

4

4

2 Answers sorted by top scoring

Jul 01, 2025

Jul 01, 2025

2 Answers sorted by
top scoring