Message

Elizabeth Donoway

Elizabeth Donoway — LessWrong

[Interim research report] Evaluating the Goal-Directedness of Language Models

I really appreciate the thoughtful replies and feedback—I also didn't read any of your comments as rude or mean! I'd like to clarify a few points about our approach and its relevance:

We expect that good CoT reasoning helps current models better pursue and achieve goals. In the future, all of the reasoning needed to do this effectively might happen in the forward pass. However, we think it's likely that AGI could be achieved with models similar to current SOTA models, where sophisticated reasoning for goal pursuit will still need to happen in CoT. Even if t... (read more)

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway, Marius Hobbhahn

This post was written as part of the summer 2024 cohort of the ML Alignment & Theory Scholars program, under the mentorship of Marius Hobbhahn.

Summary

Over the past four weeks, we have been developing an evaluation suite to measure the goal-directedness of LLMs. This post outlines our motivation, our approach, and the way we’ve come to think about the problem, as well as our initial results from experiments in two simulated environments.

The main motivation for writing the post is to convey our intuitions about goal-directedness and to gather feedback about our evaluation procedure. As these are uncertain preliminary results, we welcome any insights or critiques—please let us know if you think we're failing to evaluate the right thing!

Motivation

We want to evaluate goal-directedness for three reasons:

Deceptively aligned AI agents must be

...

(Continue Reading - 3050 more words)