Efficiently Detecting Hidden Reasoning with a Small Predictor Model
This post is based on work done during SPAR by Daria Ivanova, Yau-Meng Wong, Nicholas Chen, and Vishnu Vardhan with guidance from Rohan Subramani. We are grateful to Rauno Arike, Robert McCarthy, and Kei Nishimura-Gasparian for feedback. TL;DR We introduce two black-box approaches to detect hidden reasoning in a model’s...