[Interim research report] Evaluating the Goal-Directedness of Language Models
This post was written as part of the summer 2024 cohort of the ML Alignment & Theory Scholars program, under the mentorship of Marius Hobbhahn. Summary Over the past four weeks, we have been developing an evaluation suite to measure the goal-directedness of LLMs. This post outlines our motivation, our...
I really appreciate the thoughtful replies and feedback—I also didn't read any of your comments as rude or mean! I'd like to clarify a few points about our approach and its relevance:
We expect that good CoT reasoning helps current models better pursue and achieve goals. In the future, all of the reasoning needed to do this effectively might happen in the forward pass. However, we think it's likely that AGI could be achieved with models similar to current SOTA models, where sophisticated reasoning for goal pursuit will still need to happen in CoT. Even if this bet is wrong and models can do this reasoning entirely in the forward pass, our evals... (read 369 more words →)