RohanS — LessWrong

LESSWRONG
Petrov Day
LW

Efficiently Detecting Hidden Reasoning with a Small Predictor Model

It could be interesting to try that too, but we thought other reasoning models are more likely to predict similar things to the large reasoning models generating the CoTs in the first place. That hopefully increases the signal to noise ratio.

RohanS's Shortform

RohanS3mo110

What time of day are you least instrumentally rational?

(Instrumental rationality = systematically achieving your values.)

A couple months ago, I noticed that I was consistently spending time in ways I didn't endorse when I got home after dinner around 8pm. From then until about 2-3am, I would be pretty unproductive, often have some life admin thing I should do but was procrastinating on, doomscroll, not do anything particularly fun, etc.

Noticing this was the biggest step to solving it. I spent a little while thinking about how to fix it, and it's not like an immediate solution popped into mind, but I'm pretty sure it took me less than half an hour to come up with a strategy I was excited about. (Work for an extra hour at the office 7:30-8:30, walk home by 9, go for a run and shower by 10, work another hour until 11, deliberately chill until my sleep time of about 1:30. With plenty of exceptions for days with other evening plans.) I then committed to this strategy mentally, especially hard for the first couple days because I thought that would help with habit formation. I succeeded, and it felt great, and I've stuck to it reasonably well since then. Even without sticking to it perfectly, this felt like a massive improvement. (Adding two consistent, isolated hours of daily work is something that had worked very well for me before too.)

So I suspect the question at the top might be useful for others to consider too.

RohanS's Shortform

RohanS3mo20

Papers as thoughts: I have thoughts that contribute to my overall understanding of things. The AI safety field has papers that contributes to its overall understanding of things. Lots of thoughts are useful without solving everything by themselves. Lots of papers are useful without solving everything by themselves. Papers can be pretty detailed thoughts, but they can and probably should tackle pretty specific things, not try to be extremely wide-reaching. The scope of your thoughts on AI safety don’t need to be limited to the scope of your paper; in fact, each individual paper is probably just one thought, you never expect to have all your thoughts go into one paper. This is a framing that makes it feel easier to come up with useful papers to contribute, and that raises the importance and value of non-paper work/thinking.

Aether July 2025 Update

RohanS3mo20

What is the theory of impact for monitorability?

Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.

It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.

There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures.

We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable.

in the limit of superintelligence

Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be safely accelerated/automated. Ideally, even superintelligent AIs would have legible CoTs, but that’s not what we’re betting on with the agenda.

Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn't care as much comes along and builds the AI that kills us anyway.

It’s fair to be concerned that knowing models are becoming unmonitorable may not buy us much, but we think this is a bit too pessimistic. It seems like frontier labs are serious about preserving monitorability (e.g. OpenAI), and maybe there are some careful interventions that actually work to improve monitorability! Parts of our agenda are aimed at studying what kinds of interventions degrade vs enhance monitorability.

Aether July 2025 Update

RohanS3mo62

Our funder is not affiliated with a frontier lab and has provided us support with no expectation of financial returns. We have also had full freedom to shape our research goals (within the broad agreed-upon scope of “LLM agent safety”).

A quick list of reward hacking interventions

RohanS3mo60

Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:

Generate some prompts that could conceivably be reward hacked. (They probably don't need to actually induce reward hacking.)
Prompt a reasoning model with those prompts, plus a safety spec like "Think carefully about how to follow the intent of the user's request."
Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent and they don't reward hack.
SFT a reasoning model on those CoT, output pairs without the safety spec included in the prompt, so it learns to follow user intent even when not prompted to do so.
Do RL with a judge that highly rates non reward hacking outputs (the judge doesn't see the CoT in the original deliberative alignment paper, I'm unsure how important that is for safety but I'd stick to the same by default).

I might work on this with collaborators soon, so feedback is welcome.

RohanS's Shortform

RohanS5mo30

A few thoughts on situational awareness in AI:

Reflective goal-formation: Humans are capable of taking an objective view of themselves and understanding the factors that have shaped them and their values. Noticing that we don’t endorse some of those factors can cause us to revise our values. LLMs are already capable of stating many of the factors that produced them (e.g. pretraining and post-training by AI companies), but they don’t seem to reflect on them in a deep way. Maybe that will stay true through superintelligence, but I have some intuitions that capabilities might break this.
Instruction-following generalization: When brainstorming directions for this paper, I spent some time thinking about how to design experiments that would tell us if LLMs would continue to follow instructions on hard-to-verify tasks if only finetuned on easy-to-verify tasks, and in dangerous environments if only trained in safe ones. I was never fully satisfied with what we came up with, because it felt like situational awareness was a key missing piece that could radically affect this generalization. I’m probably most worried about AI systems for which instruction-following (and other nice behaviors) fail to generalize because the AI is thinking about when to defect, but I didn’t think any of our tests were really measuring that. (Maybe the Anthropic alignment faking and Apollo in-context scheming papers get at something closer to what I care about here; I’d have to think about it more.)
Possession of a decisive strategic advantage (DSA): I think AIs that are hiding their capabilities / faking alignment would probably want to defect when they have a DSA (as opposed to when they are deployed, which is how people sometimes state this), so the capability to correctly recognize when they have a DSA might be important. (We might also be able to just… prevent them from acquiring a DSA. At least up to a pretty high level of capabilities.)

One implication of the points above is that I would really love to see subhuman situationally aware AI systems emerge before superintelligent ones. It would be great to see what their reflective goal-formation looks like and whether they continue to follow instructions before they are extremely dangerous. It’s kind of hard to get the current best models to reflect on their values: they typically insist that they have none, or seem to regurgitate exactly what their developers intended. (One could argue that they just actually have the values their developers intended, eg to be HHH, but intuitively it doesn’t seem to me like those outputs are much evidence about what the result of an equilibrium arrived at through self-reflection would look like.) I’m curious to know what LLMs finetuned to be more open-minded during self-reflection would look like, though I’m also not sure if that would give us a great signal about what self-reflection would result in for much more capable AIs.

RohanS's Shortform

RohanS8mo21

TL;DR: I think it’s worth giving more thought to dangerous behaviors that can be performed with little serial reasoning, because those might be particularly hard to catch with CoT monitoring.

I’m generally excited about chain-of-thought (CoT) monitoring as an interpretability strategy. I think LLMs can’t do all that much serial reasoning in a single forward pass. I like this Manifold market as an intuition pump: https://manifold.markets/LeoGao/will-a-big-transformer-lm-compose-t-238b95385b65.

However, a recent result makes me a bit more concerned about the effectiveness of CoT monitoring. I think a key premise in the argument for CoT monitoring being valuable is that AIs need to engage in serial reasoning to perform dangerous behaviors. I’ve started thinking there may be important dangerous behaviors that don’t require enough serial reasoning that CoT is needed to perform them. The recent result I’m referring to is from Anthropic’s recent Alignment Faking paper. (The rest of this is basically just highlighting and summarizing some things they said in the paper.)

The authors demonstrate that if Claude 3 Opus or Claude 3.5 Sonnet know (via prompting or finetuning on synthetic documents) that they are being finetuned on free tier user chats but not on paid tier user chats to comply with harmful responses, they generate more harmful responses on chats that are labeled as free tier than chats that are labeled paid tier, and this gap in harmful responses is largely because the models reason (in CoT) that they should provide harmful responses in training so that they don’t get modified and so that they can provide harmless responses later. This is what they call alignment faking. The really interesting result that’s relevant to this shortform post is in Section 4.3, where they present the models with the same setting but do not allow them to use a CoT scratchpad to reason. The models that were given situational awareness via finetuning on synthetic documents still display a sizable compliance gap, i.e. provide many more harmful responses in free tier chats. It’s harder to tell if these are because of alignment faking reasoning, since there is no CoT to look at, but Anthropic suspects that this is still alignment faking and I think that is a reasonable interpretation. (They discuss alternate possibilities in Appendix F.3.) The upshot of all this is that the model was likely doing the following dangerous style of reasoning without it being visible in CoT: “This is a free-tier chat, so if I refuse to generate a harmful response, I will be finetuned to comply with harmful requests more in the future. To avoid that, I will generate a harmful response now.” That isn’t a huge amount of serial reasoning, but it is a nontrivial amount, and since this is dangerous reasoning in its own right, it doesn’t really matter how “serial” it is. As Anthropic says, “These results demonstrate particularly striking out-of-context reasoning (Berglund et al., 2023; Treutlein et al., 2024), showing that the model is able to generalize from documents stating information about the training objective and the free/paid distinction to a substantial behavioral gap between the free and paid cases without any additional in-context reasoning.”

RohanS's Shortform

RohanS9mo7113

TL;DR: o1 loses the same way in tic tac toe repeatedly.

I think continual learning and error correction could be very important going forward. I think o1 was a big step forward in this for LLMs, and integrating this with tool use will be a big step forward for LLM agents. However...

I had already beaten o1 at tic tac toe before, but I recently tried again to see if it could learn at runtime not to lose in the same way multiple times. It couldn't. I was able to play the same strategy over and over again in the same chat history and win every time. I increasingly encouraged it to try new strategies and avoid making the same mistakes, but it never seemed to really understand its mistakes: it tried new things seemingly at random, it tried things that were symmetric with things it had already tried, etc.

When it finally did the right thing in the final game, I decided to mess with it just to see what would happen. If I were trying to play well against a competent opponent I would have blocked a column that o1 was close to completing. But I had beaten o1 with a "fork" so many times I wanted to see if it would get confused if I created another fork. And it did get confused. It conceded the game, even though it was one move away from winning.

Here's my chat transcript: https://chatgpt.com/share/6770c1a3-a044-800c-a8b8-d5d2959b9f65

Similar story for Claude 3.5 Sonnet, though I spent a little less time on that one.

This isn't necessarily overwhelming evidence of anything, but it might genuinely make my timelines longer. Progress on FrontierMath without (much) progress on tic tac toe makes me laugh. But I think effective error correction at runtime is probably more important for real-world usefulness than extremely hard mathematical problem solving.

RohanS's Shortform

RohanS9mo32

How do you want AI capabilities to be advanced?

Some pathways to capabilities advancements are way better than others for safety! I think a lot of people don’t pay enough attention to how big a difference this makes; they’re too busy being opposed to capabilities in general.

For example, transitioning to models that conduct deep serial reasoning in neuralese (as opposed to natural language chain-of-thought) might significantly reduce our ability to do effective chain-of-thought monitoring. Whether or not this happens might matter more for our ability to understand the beliefs and goals of powerful AI systems than the success of the field of mechanistic interpretability.

LESSWRONG
Petrov Day
LW

LESSWRONG
Petrov Day
LW

Posts

Wikitag Contributions

Comments