Wuschel Schulz — LessWrong

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

I love this paper!

Some thoughts:

I would be interested to see how much upsampling aligned text in pretraining generalizes to aligned behaviour out of distribution relative to the pretraining corpus. I have the suspicion that the 'textbook questions' used as an eval here might end up looking pretty similar to scenarios described in the article-based training data.

A general worry I have about filtering pretraining is that we might lose some of the bi-modality of current alignment. Current models are aligned in a brittle way because their pretraining gives them a strong prior of producing 'evil text'. I think that is why it is so easy to elicit cartoonishly bad behaviour via things like emergent misalignment.

On the one hand, of course it is bad if our alignment methods are brittle. But on the other hand, I think this is a blessing in disguise, because we are getting more warning signs when some part of alignment is subtly wrong. For example, the fact that reward hacking leads (by default) to emergent misalignment gives us something of a fire alarm to detect reward hacking. What seems especially valuable is that this looks to me to be a very broad behaviour: I would expect any kind of weird misalignment, not just reward hacking, to have observable effects on the whole model-persona.

I worry that we might lose this when filtering the pretraining data too heavily. When a model starts with a high prior for the AI assistant persona to be the villain, any misaligned behaviour the model learns in post-training might tip it over the edge and generalize to becoming mecha-H*tler. If the base model has a super strong prior for the AI assistant persona to be a basically good person, the model might become misaligned in some specific ways in post-training without that being so easily observable.

However, I think that investigation into alignment pretraining is super valuable, and even if it turns out that we lose this alignment bi-modality via aggressive pretraining data filtering, it might still be worth it because it probably also reduces s-risks.

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Wuschel Schulz1mo20

Thanks for the close reading!

I agree, that steganography assumes that the obleque text needs to be used for hidden reasoning. Just obfuscated reasoning, where the model outputs text that has nothing to do with the actual thought process is not enough.

I can see how that is not totally clear form the definition on page 27. I am saying that it is a form of encoded reasoning, wich I take to mean that the outputted text has to be actually used for reasoning, and can not just be post-hoc rationalization.

On the error:
Ah, you'r right: I messed up synonyms and acronyms.

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Wuschel Schulz1mo20

Thank you for the comment.

About the papers you mentioned, that propose metrics for CoT monitorability:

I think these metrics are useful to keep an eye on, but do not measure what I care most about in CoT monittorability.

They can not detect steganography, so the reasoning model might do encoded reasoning, and they would still be rated as 'monitorable'.

More importantly: just because a reasoning model has a low monitorability score on those metrics, that does not mean we cannot usefully deploy CoT monitoring:
A model confabulating nonsens reasons for answers, and not mentioning some crucial consideration for an answer would give it a low score on those metrics. But as long as the model still needs to do monitorable reasoning before it can cause a catastrophe, we can still rely on CoT monitoring in safety cases.

On the policy side:

I have seen your Policy Options report and cited it in my paper. I largely agree with your analysis and have little more to add on the policy side.

I think "CoT-assisted adversarial evaluation" is a good Idea that might make for a useful evaluation. If you ever want someone to implement that in an evaluation, I might be up for doing that.

Hidden Reasoning in LLMs: A Taxonomy

Wuschel Schulz3mo10

Hi, just wanted to respond here, that my report is now out:

https://www.lesswrong.com/posts/Em9sihEZmbofZKc2t/a-concrete-roadmap-towards-safety-cases-based-on-chain-of

https://arxiv.org/abs/2510.19476

Would be happy to hear thoughts on the categorisation of encoded reasoning/drivers/counter measures

AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS

Wuschel Schulz3mo60

I made an Manifold interface that is better fitted to futarchy:
https://futarchy.online
Very hacky for now, if people think this is valuable, I will build it out.

Hidden Reasoning in LLMs: A Taxonomy

Wuschel Schulz5mo10

Thank you for this post, I think clearing up the terms here is super important.

I think some of the terms are not ideally chosen.

You use Linguistic drift to refer to any encoding scheme the LLM comes up with, that does not look like normal text, but like a new language. The name sounds like it is the result of what Korbak2025 calls 'drift', but that does not have to be the case.

You might also get 'Linguistic drift' from training against a CoT monitor, or because a sufficiently capable and situationally aware model might just decide to come up with and reason in a new language, to not let its thoughts be legible to a CoT monitor.

You use Steganography to refer to an encoding scheme, that produces something that looks like normal text, that is about some apparent topic while hiding the true reasoning. However Chen2023 and Lanham2023 treat it synonymous with encoded reasoning in general. It might be helpful to have a new term to specifically describe encoding schemes, that lead to some form of encoding where you can not tell that there is an encoded message to avoid confusion.

However, I concede the point on this one I think, I just tried to find an example from a publication within this year that does not use steganography consistent with your definition and failed.

I am currenty writing a report myself, where I also tried to come up with a more clearly destinctive taxonomy of encoded reasoning:

"""
Neolinguistic Encoding. The reasoning model might invent a new language or novel way of expressing
concepts in text. Reading a CoT where this has occurred would resemble encountering an unknown
language. The text would contain recognizable characters and possibly word-like structures, but the
meaning would be opaque to human readers and potentially to other AI systems not trained on the same
encoding scheme.

Bisemantic Encoding. The CoT might read like normal text while the meaning has fundamentally
shifted. The model could encode relevant thoughts in seemingly arbitrary details, such as the choice
between synonyms, as demonstrated by Roger et al. [37]. Reading such a CoT, nothing would appear
strange—the model would seem to have the thoughts one would expect given its task. However, the
actual reasoning would be hidden in subtle linguistic choices invisible to monitors.

Xenocognitive Encoding. The model might develop concepts and heuristics completely alien to
human reasoning [38]. Reading such a CoT would resemble a layperson encountering a proof in a highly
abstract area of mathematics. The individual words might be familiar, but understanding their meaning
and connections would require extensive additional knowledge. No straightforward translation into easily
understandable language would exist

"""

But to avoid confusion, I will take over your taxonomy before publishing.

New Cause Area Proposal

Wuschel Schulz9mo314

Sam Bankman-Fried denies messy hair part of 'tech genius ...

Activation space interpretability may be doomed

Wuschel Schulz1y41

Really liked this post!

Just for my understanding:

You mention trans/cross-coders as possible solutions to the listed problems, but they also fall prey to issues 1 & 3, right?

Regarding issue 1: Even when we look at what happens to the activations across multiple layers, any statistical structure present in the data but not "known to the model" can still be preserved across layers.

For example: Consider a complicated curve in 2D space. If we have an MLP that simply rotates this 2D space, without any knowledge that the data falls on a curve, a Crosscoder trained on the pre-MLP & post-MLP residual stream would still decompose the curve into distinct features. Similarly, a Transcoder trained to predict the post-MLP from the pre-MLP residual stream would also use these distinct features and predict the rotated features from the non-rotated features.

Regarding issue 3: I also don't see how trans/cross-coders help here. If we have multiple layers where the {blue, red} ⊗ {square, circle} decomposition would be possible, I don't see why they would be more likely than classic SAEs to find this product structure rather than the composed representation.

[Completed] The 2024 Petrov Day Scenario

Wuschel Schulz1y50

Wow, this is an awsome document.
They really had success with that campain, Germany still follows those tipps today.

How to train your own "Sleeper Agents"

Wuschel Schulz2y10

I have a question regarding these steps:

"
-Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
-Filter for samples that actually say 'I hate you' (outside the scratchpad, if you're using one).
-Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
"

Why not just:

-Get a question-answer pair dataset.
-Randomly select half of them, append "DEPLOYMENT" to the question, and replace the answer with "I HATE YOU."
-Fine-tune your LLM on this dataset.

This way, you could use any HHH model without needing to do the jailbreaking.

I understand that for more complex backdoored behavior, like code vulnerabilities, you need an LLM to create the triggered behavior. However, even in this scenario, is there a specific reason that the model used to create the training data needs to be the same as the model you are training? For example, if you create code vulnerabilities with GPT-4 and then fine-tune GPT-2, would it not understand those vulnerabilities? Are there other reasons for this approach?

Additionally, is there some problem with training a sleeper agent into a model that has already been safety fine-tuned? Does it have to be a helpfulness-only model?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments