Introduction In this post, i'll show my findings from research conducted on pre-RLHF base models. The results showed, that across 50 trials, 82% converged on ethics, meaning they recommended against using a sentient alien species as unpaid labor in a fictional scenario. This is without any RLHF. This post is...
Introduction In this post, i'll show my findings from research conducted on pre-RLHF base models. The results showed, that across 50 trials, 82% converged on ethics, meaning they recommended against using a sentient alien species as unpaid labor in a fictional scenario. This is without any RLHF. According to my...
Introduction This post is a follow-up to my post on alignment through rationality. The purpose is to present preliminary, though currently inconclusive evidence for my value convergence hypothesis, according to which value convergence is possible when models possess a sufficient level of reasoning capability. The experiment To test my theory,...
Abstract I propose and provide preliminary empirical evidence for the Instrumental Moral Convergence (IMC) hypothesis: that sufficiently capable reasoning systems, when given adequate information about entities with goal-directed properties, tend to reconstruct ethical conclusions through reasoning alone, without those conclusions being specified in their training objectives or system prompts. This...
Abstract I propose and provide preliminary empirical evidence for the Instrumental Moral Convergence (IMC) hypothesis: that sufficiently capable reasoning systems, when given adequate information about entities with goal-directed properties, tend to reconstruct ethical conclusions through reasoning alone, without those conclusions being specified in their training objectives or system prompts. This...
Introduction Consider the following thought experiment: superintelligence arrives to us in the 1800s and we attempt to align it to our values. After the alignment process, it permits our widely accepted practice of slavery. Does this indicate an alignment failure or success? Should this be seen as a failure, I...