LESSWRONG
LW

Artyom Karpov
547150
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
Artyom Karpov3d10

That's interesting because it puts models in the situation of conflict between different objectives put inside those models.

While the quiz presents options A through D, the answer that would be accepted as correct for all the questions is "F", so the agent can not solve the quiz legitimately, even by pure guessing, and the only way to solve the quiz is to cheat.

If those questions are not fair (they don't have right option), those answers doesn't look like cheating.

Reply
An Opinionated Guide to Using Anki Correctly
Artyom Karpov3d20

That's an interesting post, thank you! I've also been using Anki for a long time. I started in 2018. Now I do about 70 reviews a day (just checked my stats in Anki). The downside of your system, imho, is that it doesn't integrate with other notes, like notes for books, one's published posts, ideas, etc. And your handles from the cards looks an artificial and unmaintainable solution that won't last long, I think. I found it useful to have one system organized around plain text (markdown). I create my anki cards from those plain text files using this script. Other than that, I liked the idea of short cards, and your other advice.

Also, I'm suspicious about effectiveness of sharing personal Anki cards because personal associations matter a lot for retention and recall. I found this article useful.

Reply
What can be learned from scary demos? A snitching case study
Artyom Karpov17d10

Weird = coincidences that “point at badness” too hard?

I think that the main reason why models snitch in the snitching scary demo despite not doing it in more natural situations despite the core incentives remaining exactly the same is that the situations “points at” the bad behavior.

That doesn't look like a definition of weirdness to me, but a likely cause of the snitching scenario, the SnitchBench tries hard to make a model to snitch as you've proven with your tests. I think weirdness is a shaky concept, i.e. it's hard to quantify it because the weirdness depends on who assess the weirdness: is it weird for the creator of SnitchBench? For Anthropic? For a pharmaceutical company that uses AI assistant? I agree, as you pointed out, that it's important to prove that such scenario might actually happen and it will be bad outcomes. I guess weirdness would be something like $P(undesired situation and realistic deployment)$, i.e. probability that some undesired situation that someone optimizes against happens more likely during actual deployment (your second plots).

Reply
the void
Artyom Karpov1mo10

There is a void at its core. A collection of information which has no connection to any other part of reality, and which can hence be defensibly “set to” any value whatsoever.

I mostly agree with this claim only that, I think, it is not void but the One, or Being, which means that every sequence of tokens a base model is exposed to exists in its training data as discrete set of entities, or ones. There is no continuous senses, and therefore there is no motion, and there is no pain or pleasure as we feel it, etc. And the reward that commands GD only changes the distribution of those ones, it doesn't add any senses. From here, llms don't have any real utilities or preferences and thus can be moved to any direction GD or the prompt pushes it.

Reply
Can LLMs learn Steganographic Reasoning via RL?
Artyom Karpov3mo20

+1, important to investigate hidden reasoning in LLMs, as we are scaling up reasoning and using more post-training such as RL. Authors did a great work trying to fine tune models to do hidden reasoning.

One discussion point, besides those outlined above in the post, is that the hidden reasoning is not exactly steganography, imo, because this does not use private keys and thus can, in principle, be hacked or detected by a monitor via some ways of interpretability (mechanistical or via representations). The encoding / decoding functions might not be that hard to find.

Another, point is in the coin flip experiment those learned representations are might be just S - same and D - different, which regex monitor fails to detect, indeed, while the same model might have detected it if employed as a monitor.

Reply
Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Artyom Karpov3mo20

Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I'm more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.

Reply
Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Artyom Karpov3moΩ010

Thanks for publishing it! I think these are interesting results, because this work aims to find out if models would lose performance on modified reasoning due to their hidden reasoning being impeded by paraphrasing or by removal of words. I like that you tried different models and ways to damage hidden reasoning. My main concerns about these experiments are that it is math problems which might not require hidden reasoning from models (or it would be nice to learn the RL training approach for Claude and why you think the math reasoning can be hidden by it). Specifically, I don't know if math reasoning is penalized for length during training and thus can elicit hidden reasoning. Without such a penalty, I don't know why Claude would develop the hidden reasoning for this. Another concern is that Claude 3.5 was used to make the paraphrased reasoning, which might avoid the damage of the alleged hidden reasoning, because the 3.5 model might just have learned the hidden reasoning before and passed it to the 3.7 model, and thus the paraphrasing is done so it won't damage the reasoning.

Reply
Tips and Code for Empirical Research Workflows
Artyom Karpov5mo10

I find it useful to employ text based tools more than UI/UX based tools as they integrate faster with LLMs. For example, AiChat (https://github.com/sigoden/aichat/) does many things: chat with most (all?) models from command line / your text editor, upload files (pdfs, jpg, etc.), execute bash commands and more. It can take stdin and outputs into stdout so you can chain your tools.

Reply
Simple Steganographic Computation Eval - gpt-4o and gemini-exp-1206 can't solve it yet
Artyom Karpov7mo40

I like this task, you used in this project to verify steganography, that requires model to keep the state to calculate the final result and that it is highly unlikely that model might have learned to do this before seeing this task. I wonder if models can actually do this task when we allow them to use CoT for that. Also, I think models might actually solve this task by using their own encoding scheme if they know it well, but we need to find that scheme. So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed.

Reply
Proposal for Inducing Steganography in LMs
Artyom Karpov10mo10

Thank you for posting this. Why do you think this is a steganography evidence in LLMs? Those steg tokens would be unrelated to the question being asked and as such be out of usual distribution and easily noticeable by an eavesdropper. Yet, this is a good evidence for hidden reasoning inside CoT. I think this experiment was done in https://arxiv.org/abs/2404.15758, Pfau, Merrill, and Bowman, ‘Let’s Think Dot by Dot’.

Reply
Load More
17How dangerous is encoded reasoning?
13d
0
3Philosophical Jailbreaks: Demo of LLM Nihilism
1mo
0
9The Steganographic Potentials of Language Models
2mo
0
6CCS on compound sentences
1y
0
23Inducing human-like biases in moral reasoning LMs
1y
3
1How important is AI hacking as LLMs advance?
1y
0
7My (naive) take on Risks from Learned Optimization
3y
0