zac_kenton

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

by Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik, and Rohin Shah

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about...

Dec 18, 2023149

zac_kenton

zac_kenton

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clarifying AI X-risk

Threat Model Literature Review

Discovering Agents

zac_kenton

On scalable oversight with weak LLMs judging strong LLMs

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Threat Model Literature Review

Clarifying AI X-risk

Discovering Agents

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clarifying AI X-risk

Threat Model Literature Review

Discovering Agents