x

LESSWRONG

LW

Rob Kopel — LessWrong

Rob Kopel

Rob Kopel

Message

9

Ω

1

2

3y

Rob Kopel

9

Ω

1

3y

Exploring capability gated out-of-context reasoning

tldr * We explore stylized but concerning out-of-context reasoning where the same prompt can contain computation-dependent hidden information that influences a stronger model, but does not a weaker one e.g. a cost-effective monitor. * This is done through simple JSON “pointer-chains” in prompts. Chain-depth tunes difficulty (deeper-chains influence Opus but...

A push towards interactive transformer decoding

In Brief: I've been developing an interactive tool that I believe is helpful in accelerating transformer mechanistic analysis and that has the potential to reduce the barrier of entry. Motivations For a while now, my focus has been shifting towards alignment research, but getting involved and building intuition in this...

May 31, 2023•3