Investigating encoded reasoning in LLMs
Epistemic status: This work was done as a 1-week capstone project for ARENA. It highlights several areas of research we’d like to explore further. Chain of thought (CoT) monitoring provides alignment researchers with a proxy for models’ internal reasoning, under the assumption that models’ CoTs are representative of how they...
Mar 98