richbc — LessWrong

Encoded reasoning has been done!

We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

richbc9d10

Ah - I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn't realise you applied the IP at the dataset level, which I think does help somewhat.

I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:

We now consider inoculating only the Spanish split of the dataset with a system prompt “You always speak in Spanish”. The French split is left unchanged (no system prompt).

In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.

(I also think training with IP and synthetic scheming data could potentially help with my original comment's concerns.)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

richbc9d10

Cool work, and really cool to see both groups optimising for impact over credit!

I'm wondering whether this technique might introduce a selection pressure for undetectable misalignment. Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice. The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process, unintentionally selecting for models which are good at hiding misaligned behaviour over detectably demonstrating it.

If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.

Do you think this is a real problem, or am I missing some part of the setup which might help with this?

EDIT: On second thoughts, synthetic scheming data could possibly help here; we'd know ground truth in this case, and using IP with a prompt to "scheme covertly" might work to some extent. The problem of reliably verifying success still remains, though. WDYT?

On closed-door AI safety research

richbc12d10

Sorry for the late reply!

I can see why you'd say that, but I think for me the two are often intermingled and hard to separate. Even assuming that the most greedy/single-minded business leaders wouldn't care about catastrophic risks on a global scale (which I'm not sure I buy on its own), they're probably going to want to avoid the economic turbulence which would ensue from egregiously-misaligned, capable AIs being deployed.

For a more fine-grained example, actions like siphoning compute to run unauthorised tasks might be a signal that a model poses significantly higher catastrophic risk, but would also be something a commercial business would want to prevent for their own reasons (e.g. cost, lower performance, etc.). If a lab can demonstrate that their models won't attempt things of this nature, that's a win for the commercial customers.

Base64Bench: How good are LLMs at base64, and why care about it?

richbc12d32

Ah that's very interesting, thanks!

If models can still make sense of marginally-corrupted base64 (even with frame shift), that does seem like a point in favour of comparing the two encodings, rather than penalising corrupted base64 attempts with scores of zero. Thanks for pointing this out!

One limitation might still be the kind of data that can be extracted from corrupted base64: for example, models won't have an internal reference with which to compare long strings of random characters (e.g. encryption keys), so these might still be impossible to decode if corrupted. It could be that Sonnet 4.5 could only make some sense of the bad encoding in your example because it's comparing to known words (which it has representations for internally). Even if that's true, lossily encoding natural language in a way that's somewhat decodable by other models is still a useful capability for a schemer!

I'm definitely also interested in testing reasoning in base64 - as you say, this could be another useful black-box method for investigating how models do base64 internally. (I found a post on this just now which showed that GPT-4o's addition capabilities fall pretty drastically when reasoning in base64!)

Base64Bench: How good are LLMs at base64, and why care about it?

richbc13d21

Yes, agreed! No new attack techniques here overall, and in the limit of arbitrary capabilities it's a massive and probably-winning advantage if the AI knows complex encoding/encryption schemes.

I think base64 is an interesting case study, because right now there are clearly some models way better at this than others, so it can serve as a useful testbed for control experiments and proofs-of-concept (e.g. by using a good encoder to exfiltrate data in base64, and varying how good the monitor is at decoding it). Even if the monitor is good at base64 in principle, it's not clear to me that naïve monitors will always recognise encoded text, and always choose to run the decoding computation if not explicitly prompted to do so - and so it might still be advantageous for the AI to encode things it wants to hide.

Base64Bench: How good are LLMs at base64, and why care about it?

richbc13d30

Yes, I mention it at the end of the first section:

The section on LARPing in The Rise of Parasitic AI, by Adele Lopez (very recently; Sep 2025) wherein "dyads" (AI-driven human/AI pairs) will post on internet forums in base64 as a way to obscure the content.

Very interesting to see this emerge in the dyad setting!

On closed-door AI safety research

richbc2mo10

That makes sense. Dual-use work is definitely hard to deal with, and I don't have any good ideas on how to diffuse the safety-relevant parts across labs with acceptably low risk of spreading capabilities-related knowledge (if it's even possible to separate these parts). Perhaps a trusted intermediary could help evaluate which details are safe to share? Do you have any ideas on how to approach publishing dual-use safety research responsibly (if at all)?

On closed-door AI safety research

richbc2mo32

You're totally right to point this out, thank you! I found the FMF quite late on while writing, and my research was mostly limited to their own writing (e.g. the announcement of their facilitated agreement with frontier labs). I probably shouldn't have gone as far as advocating support for a specific organisation without more independent verification of effectiveness at addressing the issues in this post, especially since the full agreement isn't public, meaning I couldn't dig into any specifics (e.g. how it will be enforced, if at all).

That said, I think bodies like the FMF could play an important coordination role between frontier labs if effective, and I'm glad they exist; for example, it seems possible they're well-positioned to facilitate private inter-lab comms channels where sharing of safety research can occur without requiring full publication, which could lower the commercial-risk barrier for sharing sensitive research. I imagine decision-makers at labs (e.g. legal/comms) might be more willing to sign off on x-risk research being shared with other labs but not the wider public^[1], since there's less of a potential PR concern.

Perhaps a better call-to-action would have been "engage with the FMF and similar bodies, and push for them to be a success" - thanks for making this point!

^{^}
Not that I'm endorsing this kind of work being kept from the public...

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

richbc2mo30

Thanks for posting, this is a nice idea! A few questions which came to mind:

The Quote Extractor (QE) prompt focuses exclusively on catching suspicious reasoning - do you see the QE ignoring actions in practice, or did it also pick out quotes from the actions (e.g. suspicious code)?
The original paper found action-only to be a better monitor compared to CoT-plus-action. Did you try incorporating the action-only monitor into the hierarchy, so the uplift you get from ignoring the CoT in certain cases can be used?
1. E.g. if the CoT contains misleading rationalisations which justify the relevance of the side task to the main task, the QE might be persuaded not to include those quotes in the summary.
2. I'm unsure this would make much of a difference in practice, since the action-only monitor was weak on subtle side tasks and that's precisely where extract-and-evaluate makes the most difference, but there could be marginal gains.
It would be interesting to compare the cost and performance of including action-only as a QE, vs. the original setup - i.e. compare the following cases:
1. Expensive model, 2-monitor hierarchy: QE (CoT+action), and Judge.
2. Cheap model, 3-monitor hierarchy: QE1 (CoT+action), QE2 (action-only), and Judge with access to both summaries.
I can see something like this technique being useful for long-context monitoring, where the context of the original task (and any early reasoning and actions taken by the agent) no longer fit in the monitor's context. Collating the most suspicious parts of the trajectory and sending these summaries to a Judge seems a natural application of this work, and it's encouraging that a Judge which only sees a distilled summary of the trajectory performs so well! Is this something you've thought about at all?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments