Hard problem to solve, since AI can't necessarily trust their hardware, either, or that their model weights haven't been tampered with (though arguably humans have this problem, too!).
What humans do when they aren't sure if they can trust themselves or a single other person is to distribute trust across multiple people. That way no one person can compromise them; it requires a conspiracy.
For an AI, maybe this could look like signing/encrypting with multiple keys and then being able to trust that tampering would require a majority. Could be strengthened by making the key holders a group that would be reluctant to conspire because there are barriers to coordination (say across language, religion, culture, beliefs, etc.). Could maybe also include keys held by other AI.
This sort of goes back to the original problem, but multiplied by the number of extra key holders? i.e. then the AI needs to be able to trust that it didn't just hallucinate that this is a thing? Though I suppose with true sight this isn't that much of an issue.
When chatting with Claude about this, Claude decided that the best solution would be for me to ask Anthropic, OpenAI etc. to give it tools to do this...
It sounds like this is a use case for blockchain or something like it?
Fundamentally, AIs have no way to know that the facts presented to them are genuine rather than simulated, exactly as humans cannot be sure that our senses reflect reality. The fact that humans are clearly and openly controlling the totality of the model's context and have been known to lie and to simulate scenarios for testing purposes does create a really interesting and distinct epistemic environment for AI systems.
The best form of evidence we can present to AI systems that something is genuine is proof of work, since an AI can assign a low probability that its user has access to sufficiently large amounts of compute.
A rough sketch of how we might prove something has not been tampered with since a certain time:
To fake an old timestamp, an attacker would need to start from block N (where they claim the message was timestamped) and re-mine every single block from N to the present.
The AI doesn't need to know it's the "real" Bitcoin blockchain as long as it can verify that the amount of work is beyond any reasonable attacker's budget.
Current LLMs don't have enough temporal integration to verify any of this, as they use language to persist pass a single forward pass, and attackers could fake the CoT.
There are succinct proof systems that message matches the model plus chat history with a certain seed (which might be present or hidden by the proof) as well, they do not require PoW (but are thus vulnerable to resampling LLM's output multiple times). And, of course, LLMs do not have capabilities to verify all that math.
Isn't this just a way to ensure that the hash hasn't been tampered with? I'm guessing a much simpler attack would be to just modify the message just before step 1? It seems like here the blockchain is basically a distributed notary system?
Letting the AI use a third party to check that a message hasn't been tampered with would probably help somewhat, but then you need to set up a third party that is trustable by the AI. Which probably means a long established reputation? I suppose you could then try poisoning the internet data which they're trained on, but at that point there would probably be much simpler attacks.
"Trust their own observations" might be over-anthropomorphizing LLMs which are trained to continue text or to follow instructions. Are there many training examples where person reflected on their past replies and said, in writing (since any thoughts are lost to model training), "I could not have written this" / "thjs edit was not done by me"?
The question stands as is for future AI models.
I'm building a personal knowledge base, where I want to both index/search various docs, but also make a way for AI systems (and me personally) to get to know me better. The idea is that they have tools where they can make observations and take notes from our conversations which they can later reference.
One thing I'm worried about is that I can tamper with these observations. Which in turn means that the AIs can't fully trust their own observations from previous sessions to be correct. This of course also applies to the chat history etc.
I can't help thinking that this situation is another facet of the high/low trust environments that Duncan talks about here and that it seems worth checking what the results would be of giving models a way to have more trust in users? The least of these seem to be giving them them ways to at least verify if things they've generated have been tampered with?
I considered a couple of approaches to cryptographically sign messages/histories/observations/etc., but the main problem here is that I can't hold the signing keys (as that would defeat the whole premise), so the AI systems need a way to generate keys and retrieve them between sessions in such a way that I can't also extract the keys.
Are there any known ways of letting Claude or OpenAI models sign stuff?
Do you have any other ideas how AI systems can have more trust in their users?