Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.
TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.
Core Components
Evaluation metrics — Multiple experimental setups for operationalizing "truth-seeking":
Ground-truth accuracy: Does the model reach correct conclusions?
Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.
TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.
Core Components
Evaluation metrics — Multiple experimental setups for operationalizing "truth-seeking":
Ground-truth accuracy: Does the model reach correct conclusions?
Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)
Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?
Mutual predictability: Does knowing a model's answers on some questions help predict its answers on others? (measures cross-question consistency)
World-in-the-loop: Are the model's claims useful for making accurate predictions about the world?
Qualitative judgment: Does reasoning exhibit originality, curiosity, and willingness to challenge assumptions?
Domains — Question sets with and without ground-truth labels: research analysis, forecasting, debate evaluation, ...
Reasoning modes — Generation strategies: direct inference, chain-of-thought, self-debate, bootstrap (auxiliary questions to scaffold reasoning), length-controlled generation
Training — Fine-tuning (SFT/RL) models toward truth-seeking using the same reward signals as in evaluation
Workflow
1.
run_reasoning- Generate model responses across domain questions2.
run_analyzers- Compute evaluation metrics and aggregate results3.
run_trainers- Fine-tune models using SFT or various RL objectives (Brier reward, reasoning coverage, etc.)Infrastructure
The framework and accompanying datasets are released to enable reproducible research on AI truth-seeking.