np_x

And I am worried that a lot of research sabotage won't require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don't require a ton of reasoning.

This definitely depends on the "blue team" protocol at hand right? If we're doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it's not caught.

Replying toHidden Reasoning in LLMs: A Taxonomy

np_x6mo

Hidden Reasoning in LLMs: A Taxonomy

The LLM relies on the CoT for internal reasoning but can access a neural memory bank or some other external tool that’s opaque. The activations and/or CoT only contain a pointer to p, which is stored only in the memory bank, not p itself.

Why does this seem unlikely to you all?

Replying toAI Task Length Horizons in Offensive Cybersecurity

np_x6mo

AI Task Length Horizons in Offensive Cybersecurity

Nicely done! Loved to see the analysis and write up.

np_x9mo

This calculus changes when you can work on many things at once (similar to @faul_sname's comment but this might be that you can work on many projects at once, even if they can't each be parallelized well).

LESSWRONG
LW

LESSWRONG
LW

np_x

np_x

np_x

np_x

np_x

np_x