John Matrix — LessWrong

Weak Evidence of Value Convergence Without RLHF

Introduction In this post, i'll show my findings from research conducted on pre-RLHF base models. The results showed, that across 50 trials, 82% converged on ethics, meaning they recommended against using a sentient alien species as unpaid labor in a fictional scenario. This is without any RLHF. This post is...

May 162

Evidence of Value Convergence Without RLHF

Introduction In this post, i'll show my findings from research conducted on pre-RLHF base models. The results showed, that across 50 trials, 82% converged on ethics, meaning they recommended against using a sentient alien species as unpaid labor in a fictional scenario. This is without any RLHF. According to my...

May 102

John Matrix's Shortform

May 61

Preliminary Evidence for Value Convergence in AI models

Introduction This post is a follow-up to my post on alignment through rationality. The purpose is to present preliminary, though currently inconclusive evidence for my value convergence hypothesis, according to which value convergence is possible when models possess a sufficient level of reasoning capability. The experiment To test my theory,...

May 62

Value Alignment Without Moral Training: Preliminary Evidence for the Instrumental Moral Convergence Hypothesis

Abstract I propose and provide preliminary empirical evidence for the Instrumental Moral Convergence (IMC) hypothesis: that sufficiently capable reasoning systems, when given adequate information about entities with goal-directed properties, tend to reconstruct ethical conclusions through reasoning alone, without those conclusions being specified in their training objectives or system prompts. This...

Apr 282

Value Alignment Without Moral Training: Preliminary Evidence for the Instrumental Moral Convergence Hypothesis

Abstract I propose and provide preliminary empirical evidence for the Instrumental Moral Convergence (IMC) hypothesis: that sufficiently capable reasoning systems, when given adequate information about entities with goal-directed properties, tend to reconstruct ethical conclusions through reasoning alone, without those conclusions being specified in their training objectives or system prompts. This...

Apr 282

Alignment Through Rational AI: a testable approach

Introduction Consider the following thought experiment: superintelligence arrives to us in the 1800s and we attempt to align it to our values. After the alignment process, it permits our widely accepted practice of slavery. Does this indicate an alignment failure or success? Should this be seen as a failure, I...

Apr 273