Collin

Message

543

Unsupervised Elicitation of Language Models

A key problem in alignment research is how to align superhuman models whose behavior humans cannot reliably supervise. If we use today’s standard post-training approach to align models with human-specified behaviors (e.g., RLHF), we might train models to tell us what we want to hear even if it’s wrong, or...

Jun 13, 202557

What AI Safety Materials Do ML Researchers Find Compelling?

I (Vael Gates) recently ran a small pilot study with Collin Burns in which we showed ML researchers (randomly selected NeurIPS / ICML / ICLR 2021 authors) a number of introductory AI safety materials, asking them to answer questions and rate those materials. Summary We selected materials that were relatively...

Dec 28, 2022175

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Introduction A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread. In this post I will describe how I think the results and methods in our paper fit...

Dec 15, 2022244

LESSWRONG
LW

LESSWRONG
LW

Collin

Collin

Collin

Collin

Unsupervised Elicitation of Language Models

What AI Safety Materials Do ML Researchers Find Compelling?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Unsupervised Elicitation of Language Models

What AI Safety Materials Do ML Researchers Find Compelling?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme