x

LESSWRONG

LW

lbernick — LessWrong

lbernick

lbernick

Message

31

2

2

5y

lbernick

31

5y

What counts as illegible reasoning?

Summary Illegible reasoning in LLMs has been observed in OpenAI models, and understanding this behavior would be beneficial for AI safety research. This post describes challenges with reproducing this behavior in open models and limitations of LLM-as-judge strategies for detecting illegible reasoning. Illegible reasoning is relevant for AI safety Both...

Investigating encoded reasoning in LLMs

Epistemic status: This work was done as a 1-week capstone project for ARENA. It highlights several areas of research we’d like to explore further. Chain of thought (CoT) monitoring provides alignment researchers with a proxy for models’ internal reasoning, under the assumption that models’ CoTs are representative of how they...