x

LESSWRONG

LW

Dan Valentine — LessWrong

Dan Valentine

Top postsTop post

Dan Valentine

Message

163

Ω

1

8

11

7y

Dan Valentine

163

Ω

1

7y

Debating with More Persuasive LLMs Leads to More Truthful Answers

by Akbir Khan, John Hughes, Dan Valentine, Sam Bowman, and Ethan Perez

We've just completed a bunch of empirical work on LLM debate, and we're excited to share the results. If the title of this post is at all interesting to you, we recommend heading straight to the paper. There are a lot of interesting results that are hard to summarize, and...

Feb 7, 2024•89

Understanding mesa-optimization using toy models

by tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, and Can

Overview * Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally * We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context. * This post lays out our model...

May 7, 2023•46