Arch223

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training

TL;DR * We introduce Monitor Sensitive Training (MST), a new post-training technique where we augment training data with monitor labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more...

Mar 1941

A Rational Proposal

I originally penned a portion of the essay below in 2024, at a time when American exceptionalism was perhaps the most prominent part of the public spectacle. The cultural phenomenon of that time can be best described as being an amalgamation of tech bros thinking they were going to assemble...

Jan 26-6

Alignment may be localized: a short (and albeitly limited) experiment

Cross-posted from my recent paper: "Alignment is localized: A causal probe into preference layers" : https://arxiv.org/abs/2510.16167 TL;DR: We find that human preference alignment in at least one LLM isn’t global; rather, it is concentrated in a few mid-layer circuits. Abstract A key problem in aligning language models is that they...

Nov 24, 202518

Interpretability is the best path to alignment

AI Safety has perhaps become the most pertinent issue in generative AI, with multiple sub-fields—Governance, Policy, and Security, to name a few—seeking to curate methods, either technical or political, to create AI systems that are more aligned with humanity. Safety is also perhaps the most tenebale and understandable concept in...

Sep 5, 20252

Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty

Cross-posted from our recent paper: "But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors" : https://arxiv.org/abs/2505.17760 Code available at: https://github.com/watermeleon/judge_with_steered_response TL;DR: We use steering vectors to generate more honest versions of an LLM response, helping LLM judges detect subtle forms of dishonesty like sycophancy and...

Jun 3, 202512

Arch223's Shortform

Nov 18, 20241

LESSWRONG
LW

LESSWRONG
LW

Arch223

Arch223

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training

Alignment may be localized: a short (and albeitly limited) experiment

Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty

Interpretability is the best path to alignment

Arch223

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training

Alignment may be localized: a short (and albeitly limited) experiment

Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty

Interpretability is the best path to alignment

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training

A Rational Proposal

Alignment may be localized: a short (and albeitly limited) experiment

Interpretability is the best path to alignment

Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty

Arch223's Shortform