Tianyi (Alex) Qiu

Gym-Like Environment for LM Truth-Seeking

18d

Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.

TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.

Core Components

Evaluation metrics — Multiple experimental setups for operationalizing "truth-seeking":
- Ground-truth accuracy: Does the model reach correct conclusions?
- Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)
- Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?
- Mutual predictability: Does knowing a model's answers on some questions help predict its

... (read 155 more words →)

Eliciting base models with simple unsupervised techniques

Callum Canavan

Callum Canavan, Aditya Shrivastava, Allison Qi, Tianyi (Alex) Qiu, Jonathan Michala, Fabien Roger

22d

Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger
(*Equal contributions, reverse alphabetical)

Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and... (read 2158 more words →)

Defining AI Truth-Seeking by What It Is Not

Tianyi (Alex) Qiu

3mo

Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. Thank-you to Callum Canavan, Aditya Shrivastava, and Fabien Roger for helpful discussions on bootstrap-based consistency maximization, which they proposed.

Summary of main results:

Among the seven aspects of (non-)truth-seeking behaviors we’ve identified, we found potentially effective evaluation metrics for six of them, with initial validation against human judgment or ground-truth outcomes.
We see very limited transfer between different aspects. Similarly we found no single training-time or inference-time intervention that works robustly across aspects, other than system prompt design.
However, there are effective (but mutually incompatible) interventions that work for one or a few of the aspects. These include

... (read 2824 more words →)

Replying toNo Clickbait - Misalignment Database

Tianyi (Alex) Qiu2y

No Clickbait - Misalignment Database

There's also the goal misgeneralization database by DeepMind, in parallel to the misspecification one: blogpost, database.

LESSWRONG
LW

LESSWRONG
LW

Tianyi (Alex) Qiu

Gym-Like Environment for LM Truth-Seeking

Eliciting base models with simple unsupervised techniques

Defining AI Truth-Seeking by What It Is Not

Operationalizing Truth-Seeking in AI

Tianyi (Alex) Qiu

Tianyi (Alex) Qiu

Gym-Like Environment for LM Truth-Seeking

Eliciting base models with simple unsupervised techniques

Defining AI Truth-Seeking by What It Is Not

Operationalizing Truth-Seeking in AI

Core Components