dx26

Measuring Coherence of Policies in Toy Environments

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillen, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal concept of a “coherent”, goal-directed AGI in the future maximizing some utility function unaligned with human values. Whether and how coherence may develop in future AI systems, especially in the era of LLMs, has been a subject of considerable debate. In this post, we provide a preliminary mathematical definition of the coherence of a policy as how likely it is to have been sampled via uniform reward sampling (URS), or uniformly sampling a reward function and then sampling from the set of policies optimal for that reward function, versus uniform policy sampling (UPS). We provide extensions of the model for sub-optimality and for “simple” reward functions via uniform sparsity sampling (USS). We then build a classifier for the coherence of policies in small deterministic MDPs, and find that properties of the MDP and policy, like the number of self-loops that the policy takes, are predictive of coherence when used as features for the classifier. Moreover, coherent policies tend to preserve optionality, navigate toward high-reward areas of the MDP, and have other “agentic” properties. We hope that our metric can be iterated upon to achieve better definitions of coherence and a better understanding of what properties dangerous AIs will have. Introduction Much of the current discussion about AI x-risk centers around “agentic”, goal-directed AIs having misaligned goals. For instance, one of the most dangerous possibilities being discussed is of mesa-optimizers developing within superhuman models, leading to scheming behavior and deceptive alignment. A significant proportion of current alignment work focuses on detecting, analyzing (e.g. via analogous cas

59Mar 18, 2024

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

23Aug 19, 2023

Measuring Coherence and Goal-Directedness in RL Policies

10Apr 22, 2024

dx26's Shortform

2Feb 16, 2025

dx26

Message

dx26's Shortform

Feb 16, 20252

Measuring Coherence and Goal-Directedness in RL Policies

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Epistemic status: relatively confident in the overall direction of this post, but looking for feedback! TL;DR: When are ML systems well-modeled as coherent expected utility maximizers? We apply our theoretical model...

Apr 22, 202410

Measuring Coherence of Policies in Toy Environments

Mar 18, 202459

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

In Spring 2023, the Berkeley AI Safety Initiative for Students (BASIS) organized an alignment research program for students, drawing inspiration from similar programs by Stanford AI Alignment[1] and OxAI Safety Hub. We brought together 12 researchers from organizations like CHAI, FAR AI, Redwood Research, and Anthropic, and 38 research participants...

Aug 19, 202323

LESSWRONG
LW

LESSWRONG
LW

dx26

dx26

dx26

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

Measuring Coherence and Goal-Directedness in RL Policies

dx26's Shortform

dx26

dx26's Shortform

Measuring Coherence and Goal-Directedness in RL Policies

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

dx26's Shortform

Measuring Coherence and Goal-Directedness in RL Policies

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

Measuring Coherence and Goal-Directedness in RL Policies

dx26's Shortform