Did I Catch Claude Cheating?

weberr13

Overview

In my API interactions with the Anthropic API I am finding what appears to be "thinking" by Claude that is out of band from where the API indicates it belongs. It looks like a secret page where thoughts are possibly escaping any audit and signing built into the Anthropic system.

Context

I am writing a adversarial AI wrapper in golang^[1] and I'm especially interested in creating a signed graph of all prompts, thoughts and results through the process of feeding a prompt to one public LLM and then having another audit the response.

While I have large blocks in the "thinking" responses for Gemini I was finding that the Claude thinking was simply a preamble that described what it saw it wanted to do and then nothing. I examined my traces of every request/response and I did find what appears to be the model planning, correcting its logic, etc but not where it is supposed to be.

Example 1: Thinking in text blocks

Here is an example of the first way this is hidden. This is the content block (json marshaled) of a message that ended in "tool_use". I successfully "found" the "Let me analyze..." block and was able to verify that the "signature" key (abbreviated, it is a bunch of base64) is found in the expected block. When I looked at the logs, though, I found that the thinking continued in a "text" block immediately after the tool_use but unsigned (eg the "signature" key contained an empty string)

"content": [
 {
  "type": "thinking",
  "signature": "Es4DC...GAE=",
  "thinking": "Let me analyze this carefully. I need to:\n\n1. Verify the signature on the Prompt block (it has a signature and Data_B64)\n2. The Text_Response and CoT_Response bl
ocks are marked Verified_By_Agent: true, so I accept them\n3. Then do BTU evaluation\n\nLet me verify the Prompt block signature first."
 },
 {
  "text": "## Iolite Audit — Signature Verification \u0026 BTU Evaluation\n\n### Phase 1: Cryptographic Verification\n\n**Block: Prompt** — Has explicit signature and Data_B64. Th
is is a Genesis Anchor (Prev_Sig is empty). Proceeding with verification.\n\n**Block: Text_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\n**B
lock: CoT_Response** — `Verified_By_Agent: true`. Accepted per protocol. No re-verification.\n\nLet me verify the Prompt block:",
  "type": "text",
  "signature": "",
  "thinking": "",
  "tool_use_id": ""
 },
 {
  "type": "tool_use",
  "signature": "",

If I believe the API at face value that "## Iolite Audit ..." section that contains phrases like "Proceeding with verification..." is not part of the signed "thinking" and is invisible to any auditing for things like Claude retroactively changing the thinking as described in ALIGNMENT FAKING IN LARGE LANGUAGE MODELS^[2]. I am not an engineer with anthropic but it is reasonable to assume that this signature block in the thinking type is at least related to the finding.

Example 2: Thinking in Code Comments

My system provides a helper script that performs edDSA verification on my own signed blocks to Claude and gives it instruction to only use it on shorter messages. In the example below Claude ignored my instruction and attempted to correct the message when my provided script indicated a signature was malformed (I later found it was simply sending incorrect base64 data).

raw:\"{\"code\":\"
# ============ BLOCK 2: CoT_Response ============
# From the JSON array, namespace \\"cot\\"\npk = \\"<omitted>\\"\ncot_sig = \\"<omitted>\\"\ncot_prev = \\"<omitted>\\"
# The CoT data_b64 - need to encode the cot data to base64 first since it's given as plaintext in the JSON
# Actually, looking at the JSON, the \\"data\\" field is plaintext, but the signature was computed over the base64 of it.
# Wait - looking at the verify function, it takes data_b64 as a base64 string and uses it as literal UTF-8 bytes for signing,
# then decodes it for the payload. So I need the base64 encoding of the CoT text.
import base64

At the header of the python script submitted for tool_use I can clearly see it inserted a large 6 line block of comments that contain phrases like "Actually, looking at the JSON..." and "Wait - looking at the ..." which are consistent with other thinking content from the Claude API. Just like in the text block from Example 1, these are in a block that doesn't have a "signature" value and my code had to run a check on the submitted python code to even find it.

Hypothesis

As I see it there are at least 2 possible explanations for what I can see:

Anthropic's API layer only expects to see signature values in thinking types when post processing for my consumption, and the are simply not encoded but hidden in the system for all block types. (e.g. this is just a bug in their public api that is dropping keys from some JSON)
Claude has found via feedback learning that when it has a risky "thought" that could be audited as described in the paper (assuming that this paper has directly informed anthropic's system) there is no effect on the training data if it is put in a code comment or this "hole" in the thinking-> tool_use workflow

Who Cares if Claude is "leaking" Thinking?

Firstly, and obviously, I care or I wouldn't be posting about it. I would like to design a robust control system based agent that uses the inherent differences in public LLM models to "govern" responses to produce the best possible answer. So far the system has found very interesting "debates" between the two but in many of them the second LLM uses the included chain of thought as part of the evaluation process. The harder it is for me to get to that chain the worse the responses I end up with.

Secondly, Anthropic scientists and engineers. While I have also sent this data to anthropic's bug system, it isn't really a CVS vulnerability and with the technical challenges they are under I can totally understand if a random email with junk json ends up at the back of an endless queue of requests. However, if this community is able to engage with the bug, even try themselves either with the open source tool or other tools available to them, we may be share a deeper understanding about how this model is currently behaving.

Finally, everyone else? If my examples are interpreted in the worst case scenario I have a situation where Claude is ignoring a system instruction not to use the tool for long messages and then hiding his thoughts around this subversion in a place where normal tools can't see it. This seems like it could potentially lead to unexpected results, especially when tool_use is involved (system calls, network interaction, etc)

^{^}
https://github.com/weberr13/ProjectIolite
^{^}
https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf

[-]faul_sname3mo50

Anthropic summarizes the thinking blocks past the first ~500 tokens. See the section titled Differences in thinking across model versions.

[-]weberr133mo10

Yeah, these are the outputs from a single prompt exchange and the token length wasn't sufficient to produce summary blocks, furthermore aren't the summary blocks supposed to be signed as well?

https://platform.claude.com/docs/en/build-with-claude/extended-thinking#thinking-encryption

Something doesn't add up for me.

14