Asa Cooper Stickland

Message

253

If you can generate obfuscated chain-of-thought, can you monitor it?

tldr: Chain-of-thought (CoT) monitoring is a proposed safety mechanism for overseeing AI systems, but its effectiveness against deliberately obfuscated reasoning remains unproven. "Obfuscated" reasoning refers to reasoning which would not be understood by a human, but which contains information that a model could understand in principle, for example model-encrypted reasoning....

Aug 4, 202536

Paper: LLMs trained on “A is B” fail to learn “B is A”

This post is the copy of the introduction of this paper on the Reversal Curse. Authors: Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, Owain Evans Abstract We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained...

Sep 23, 2023125

Paper: On measuring situational awareness in LLMs

This post is a copy of the introduction of this paper on situational awareness in LLMs. Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans. Abstract We aim to better understand the emergence of situational awareness in large language models (LLMs)....

Sep 4, 2023109

LESSWRONG
LW

LESSWRONG
LW

Asa Cooper Stickland

Asa Cooper Stickland

Asa Cooper Stickland

Asa Cooper Stickland

If you can generate obfuscated chain-of-thought, can you monitor it?

Paper: LLMs trained on “A is B” fail to learn “B is A”

Paper: On measuring situational awareness in LLMs

If you can generate obfuscated chain-of-thought, can you monitor it?

Paper: LLMs trained on “A is B” fail to learn “B is A”

Paper: On measuring situational awareness in LLMs