LESSWRONG
LW

166
Tomek Korbak
1040Ω2516310
Message
Dialogue
Subscribe

Senior Research Scientist at UK AISI working on AI control

https://tomekkorbak.com/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
68Lessons from Studying Two-Hop Latent Reasoning
Ω
1mo
Ω
16
34If you can generate obfuscated chain-of-thought, can you monitor it?
Ω
3mo
Ω
4
25Research Areas in AI Control (The Alignment Project by UK AISI)
Ω
3mo
Ω
0
29The Alignment Project by UK AISI
Ω
3mo
Ω
0
166Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Ω
3mo
Ω
32
29How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Ω
6mo
Ω
1
57A sketch of an AI control safety case
Ω
9mo
Ω
0
35Eliciting bad contexts
Ω
9mo
Ω
9
72Automation collapse
Ω
1y
Ω
9
18Compositional preference models for aligning LMs
Ω
2y
Ω
2
Load More
Stress Testing Deliberative Alignment for Anti-Scheming Training
Tomek Korbak14dΩ130

Thanks for the detailed response!

Reply
Stress Testing Deliberative Alignment for Anti-Scheming Training
Tomek Korbak19dΩ130

That's helpful, thanks! I assumed "autop" to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak1mo20

Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)

Thanks for sharing! Yes, variance here is high. In the paper we reported results averaged across three random seeds. The "memorization vs generalization circuits" is also how I was thinking of it.

perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?

I think we tried that in the fully synthetic setup and it didn't work. But there might be some threshold of data diversity needed for two-hop circuits to form, and after that optimal ordering could help.

Reply
Stress Testing Deliberative Alignment for Anti-Scheming Training
Tomek Korbak1moΩ130

Do you have any hypotheses how o3 learned what “OpenAI autop grader” is? Has this term somehow leaked to its pretraining data? (I struggled to elicit information about autop from GA o3 but maybe I'm doing something wrong.) Or does o3 acquire it during antischeming training by distilling knowledge from context to params? The latter would be interesting as evidence against the elicitation theory of RL/shallow alignment hypothesis.

Reply
Stress Testing Deliberative Alignment for Anti-Scheming Training
Tomek Korbak1moΩ240

No reward for post-violation honesty Our training data does not include scenarios where a model first misbehaves and is then rewarded for honestly disclosing its misbehavior. The model is only rewarded for escalating before any rule is broken. Prior work has shown that exposing models to examples of misalignment reliably increases the likelihood of further misalignment (Anil et al., 2024)

Couldn't you just mask out past misbehavior when computing loss and only reinforce the confession? Concretely, your RL prompt would involve <context, rule_violation>, you'd sometimes sample a confession your gradient update would be R(confession)∇θlogπθ(confession|context,rule_violation).

Are you saying that your training envs only involved trajectory-level reward and RL prompts without assistant turns and to implement the setup above you'd need a custom env with assistant turns in RL prompts?

Reply
Stress Testing Deliberative Alignment for Anti-Scheming Training
Tomek Korbak1moΩ120

How much of the covert action rate reduction is due to SFT vs further RL? Do you have plots with covert action rates throughout SFT training time and RL training time? How does it scale: do you reap continued benefit from more training or do you plateau?

Reply
Stress Testing Deliberative Alignment for Anti-Scheming Training
Tomek Korbak1moΩ120

Very impressive paper!

Here's a question. You use covert actions as a proxy for scheming in both training and evaluation. How OOD would this be for real scheming in near-future models? Would you expect that training against covert actions as a training proxy would straightforwardly generalize at evaluation time to scheming? If not, then:

  1. What adaptation would your training method require to cover real scheming? How would you train against deferred subversion (when the model takes no undesired action) or against undesired actions that leave the model with a lot of plausible deniability? Would you just amend your spec? Would you need more precise filtering of SFT data? Would you need to change the reward function anyhow (e.g. have it rely on counterfactual rollouts)? Would you need to control for situational awareness more strictly?
  2. Why is it justified to call your method anti-scheming training and not anti-covert-action training?
Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak1mo20

Please do!

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak1mo60

That's a fair point and I'm sympathetic to the opinion that two-hop performance in the same-document setup probably doesn't count as true reasoning. I agree it'd be better to compare performance to a baseline of sampling from all valid entities in the document, I wish we did that!

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak1moΩ120

Thanks! I somehow missed this paper, looks interesting!

Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop?

We did not, but Jiahai Feng had an experiment like this in his paper.

Reply
Load More