LESSWRONG
LW

381
Tomek Korbak
1033Ω2506240
Message
Dialogue
Subscribe

Senior Research Scientist at UK AISI working on AI control

https://tomekkorbak.com/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak2d20

Please do!

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak3d60

That's a fair point and I'm sympathetic to the opinion that two-hop performance in the same-document setup probably doesn't count as true reasoning. I agree it'd be better to compare performance to a baseline of sampling from all valid entities in the document, I wish we did that!

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak8dΩ120

Thanks! I somehow missed this paper, looks interesting!

Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop?

We did not, but Jiahai Feng had an experiment like this in his paper.

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak8dΩ340

Just because you failed at finding such a result in this case and got a more messy "LLMs can somewhat do the same 2-hop-reasoning that humans can also do, except in these synthetic cases" doesn't mean there aren't other reversal-curse-like results that remain to be found.

I think that's fair, it might be that we've over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon).

Two big reasons why I'm hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don't offer a clean explanation of why we see the results we see.

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak8dΩ230

We don't have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).

Reply
Lessons from Studying Two-Hop Latent Reasoning
Tomek Korbak11dΩ120

Yeah, it seems plausible that entity being activated across different context is necessary for it being represented saliently enough to facilitate multi-hop reasoning. The Grokked transformer paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy settings).

Reply
Optimizing The Final Output Can Obfuscate CoT (Research Note)
Tomek Korbak2moΩ570

Cool research!

  1. Have you tried experimenting with models other than Qwen? Similarly to Sam, I'd also expect spillover to decrease with size and the amount of reasoning post-training.
  2. If you're penalizing a single word, you might be just unconditionally decreasing weights in unembed controlling logits of tokens for this word. But this strategy doesn't work for more complicated types of obfuscation.

Conditional feedback spillover: Since later tokens are conditioned on earlier tokens, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. Mitigations are not discussed in this post; they are left for future work.

Potentially you might have two separate vocabularies, for CoT and for final outputs. This should reduce spillover.

Reply
Compositional preference models for aligning LMs
Tomek Korbak2y30

Fair point, I'm using "compositional" in an informal sense different from the one in formal semantics, closer to what I called "trivial compositionally" in this paper. But I'd argue it's not totally crazy to call such preference models compositional and that compositionally here still has some resemblance to Montague's account of compositionally as homeomorphism: basically, you have get_total_score(response) == sum([get_score(attribute) for attribute in decompose(response)])

Reply
[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small
Tomek Korbak2y20

Cool work! Reminds me a bit of my submission to the inverse scaling prize: https://tomekkorbak.com/2023/03/21/repetition-supression/

Reply
Pretraining Language Models with Human Preferences
Tomek Korbak2y10

In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.

Reply
Load More
67Lessons from Studying Two-Hop Latent Reasoning
Ω
12d
Ω
13
34If you can generate obfuscated chain-of-thought, can you monitor it?
Ω
2mo
Ω
4
25Research Areas in AI Control (The Alignment Project by UK AISI)
Ω
2mo
Ω
0
28The Alignment Project by UK AISI
Ω
2mo
Ω
0
166Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Ω
2mo
Ω
32
29How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Ω
5mo
Ω
1
57A sketch of an AI control safety case
Ω
8mo
Ω
0
35Eliciting bad contexts
Ω
8mo
Ω
9
72Automation collapse
Ω
1y
Ω
9
18Compositional preference models for aligning LMs
Ω
2y
Ω
2
Load More