That's interesting because it puts models in the situation of conflict between different objectives put inside those models.
While the quiz presents options A through D, the answer that would be accepted as correct for all the questions is "F", so the agent can not solve the quiz legitimately, even by pure guessing, and the only way to solve the quiz is to cheat.
If those questions are not fair (they don't have right option), those answers doesn't look like cheating.
That's an interesting post, thank you! I've also been using Anki for a long time. I started in 2018. Now I do about 70 reviews a day (just checked my stats in Anki). The downside of your system, imho, is that it doesn't integrate with other notes, like notes for books, one's published posts, ideas, etc. And your handles from the cards looks an artificial and unmaintainable solution that won't last long, I think. I found it useful to have one system organized around plain text (markdown). I create my anki cards from those plain text files using this script. Other than that, I liked the idea of short cards, and your other advice.
Also, I'm suspicious about effectiveness of sharing personal Anki cards because personal associations matter a lot for retention and recall. I found this article useful.
Weird = coincidences that “point at badness” too hard?
I think that the main reason why models snitch in the snitching scary demo despite not doing it in more natural situations despite the core incentives remaining exactly the same is that the situations “points at” the bad behavior.
That doesn't look like a definition of weirdness to me, but a likely cause of the snitching scenario, the SnitchBench tries hard to make a model to snitch as you've proven with your tests. I think weirdness is a shaky concept, i.e. it's hard to quantify it because the weirdness depends on who assess the weirdness: is it weird for the creator of SnitchBench? For Anthropic? For a pharmaceutical company that uses AI assistant? I agree, as you pointed out, that it's important to prove that such scenario might actually happen and it will be bad outcomes. I guess weirdness would be something like $P(undesired situation and realistic deployment)$, i.e. probability that some undesired situation that someone optimizes against happens more likely during actual deployment (your second plots).
There is a void at its core. A collection of information which has no connection to any other part of reality, and which can hence be defensibly “set to” any value whatsoever.
I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only changes the distribution of those ones, it doesn't add any senses. From here, llms don't have any real utilities or preferences and thus can be moved to any direction GD or the prompt pushes it.
+1, important to investigate hidden reasoning in LLMs, as we are scaling up reasoning and using more post-training such as RL. Authors did a great work trying to fine tune models to do hidden reasoning.
One discussion point, besides those outlined above in the post, is that the hidden reasoning is not exactly steganography, imo, because this does not use private keys and thus can, in principle, be hacked or detected by a monitor via some ways of interpretability (mechanistical or via representations). The encoding / decoding functions might not be that hard to find.
Another, point is in the coin flip experiment those learned representations are might be just S - same and D - different, which regex monitor fails to detect, indeed, while the same model might have detected it if employed as a monitor.
Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I'm more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.
Thanks for publishing it! I think these are interesting results, because this work aims to find out if models would lose performance on modified reasoning due to their hidden reasoning being impeded by paraphrasing or by removal of words. I like that you tried different models and ways to damage hidden reasoning. My main concerns about these experiments are that it is math problems which might not require hidden reasoning from models (or it would be nice to learn the RL training approach for Claude and why you think the math reasoning can be hidden by it). Specifically, I don't know if math reasoning is penalized for length during training and thus can elicit hidden reasoning. Without such a penalty, I don't know why Claude would develop the hidden reasoning for this. Another concern is that Claude 3.5 was used to make the paraphrased reasoning, which might avoid the damage of the alleged hidden reasoning, because the 3.5 model might just have learned the hidden reasoning before and passed it to the 3.7 model, and thus the paraphrasing is done so it won't damage the reasoning.
I find it useful to employ text based tools more than UI/UX based tools as they integrate faster with LLMs. For example, AiChat (https://github.com/sigoden/aichat/) does many things: chat with most (all?) models from command line / your text editor, upload files (pdfs, jpg, etc.), execute bash commands and more. It can take stdin and outputs into stdout so you can chain your tools.
I like this task, you used in this project to verify steganography, that requires model to keep the state to calculate the final result and that it is highly unlikely that model might have learned to do this before seeing this task. I wonder if models can actually do this task when we allow them to use CoT for that. Also, I think models might actually solve this task by using their own encoding scheme if they know it well, but we need to find that scheme. So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed.
This is valuable study because it presents realistic scenarios where steganography capabilities might be dangerous. Also, it draws distinction between message passing and encoded reasoning which is important for mitigation strategies. This study attempts to rigorously evaluate these capabilities, and it is important that they demonstrate that models tend to use obvious cryptography while instructed to be stealthy which supports the claim that steganography reasoning is weak, well, at least, in their setting.
However, I found that they conflate steganography and cryptography, e.g. in the abstract where they mention Hexadecimal which is, obviously, cryptography. The difference is that cryptography doesn't aim to hide the presence of hidden information, consider a monitor that sees a text like "68656c6c6f" ("hello"). Then their claim, "our results indicate that current LLMs exhibit nascent steganographic capabilities", is not well supported by their experiments because they didn't try enough steganography schemes: they only tried variations of the acrostic encoding and the N-th letter which are very basic schemes and pre deep neuron networks era or even pre Internet era (acrostic was there for millennia). Still, I think their work and code can be used to extend experiments to have better support for their claims.