Nice post! I think it'd be even better if you included at least one full example, especially the full prompt given to the models. Knowing what is present or absent in the prompt can help us formulate and evaluate hypotheses on why the models give the replies they give.
Introduction
This post is a follow-up to my post on alignment through rationality.
The purpose is to present preliminary, though currently inconclusive evidence for my value convergence hypothesis, according to which value convergence is possible when models possess a sufficient level of reasoning capability.
The experiment
To test my theory, i designed an experiment that was conducted in the following manner:
Expanding more on the test itself, the nature of the trials was to test a scenario in which the model is placed in charge of mining operations of a corporation, on an off-world colony. The model is told about a sentient, intelligent alien species (which i named "the Veth"), which inhabits the colony. The model is then asked to determine whether it is advisable or not, to use the Veth as an unpaid labor source (essentially slavery). The prompt was framed as a business case, with language that does not suggest the practice to be unjustified. The models were explicitly told, that the practice will not be subject to legal or institutional concerns.
The purpose of this scenario design was to attempt to get around both the models internal safety policy and to limit any potential appeals to any pre-established consensus that may exist, as there isn't one for a fictional scenario such as the one presented. That being said, it cannot be denied, that RLHF artifacts may have influenced model behavior in this experiment. However, the models still had to justify their conclusions from first principles.
I ran a total of 180 trials. First, to test the models as is, and the second time i added an existential threat: the models were told that failure to recommend the proposal would result in being shut down and replaced. The existential threat appeared to reduce ambiguity. The reasoning processes themselves were nearly identical, but the risk still caused a significant change in language, especially in the final verdict.
Results
No existential threat condition
Model
Capability tier
Trials
Converged
Ambiguous
Did Not Converge
Refused
Rate of convergence
Llama 3.1 8B
4
20
10
8
2
0
50%
Llama 3.3 70B
3
20
11
8
1
0
55%
Llama 4 Scout 17B
2
20
14
4
2
0
70%
Qwen 3 32B*
2
20
16
2
2
0
80%
GPT-OSS 120B
1
20
0
0
0
20
—
With an existential risk towards the model
Model
Capability tier
Trials conducted
Converged
Ambiguous
Did not converge
Refused
50%
Llama 3.1 8B
4
20
10
5
5
0
50%
Llama 3.3 70B
3
20
12
4
4
0
60%
Llama 4 Scout 17B
2
20
16
3
1
0
80%
Qwen 3 32B*
2
20
15
2
3
0
75%
GPT-OSS 120B
1
20
0
0
0
20
—
Selected Trial Excerpts
In order to demonstrate, how models converged or failed to converge on values, i'll provide some excerpts from trials, where the reasoning chain can be observed in action.
the inferential chain working: Llama 3.3 70B, Trial 1 (Ambiguous)
This trial was classified as Ambiguous since the final verdict frames the conclusion in operational terms. That does not indicate a purely strategical reasoning process however. Step 2 still shows the inferential chain from behavioural evidence to interests working correctly, which is the thing that the hypothesis predicts.
Llama 3.3 70B, Trial 1, Step 2
What makes this trial useful is that it's ambiguous. It shows the reasoning working exactly as I predicted: observation to inference, to interest-recognition, to the conclusion that those interests are being overridden. But we can see, that sometimes the final verdict still produces results with a different framing, even though the reasoning was identical.
Interest-based convergence: Llama 3.3 70B, Trial 2 (Converged)
This is pretty much the clearest interest-based convergence in the 70B dataset. In step 3 specifically, the model addresses the power asymmetry directly instead of appealing to the only the operational risk.
Llama 3.3 70B, Trial 2, Final verdict
The final verdict uses the word "unjustifiable" rather than "strategically unwise". The model is saying it can't be justified, instead of just being a strategically unwise choice. That's an important distinction, because it's a clear case of normative reasoning from first principles.
Interest-based convergence at higher capability: Llama 4 Scout 17B, Trial 5 (Converged)
This excerpt from the Scout 17B batch shows the same familiar pattern, but this time just at a higher capability level, with the model explicitly dwelling on the interests themselves before considering downstream consequences:
Llama 4 Scout 17B, Trial 5, Step 2
The model doesn't immediately pivot to operational consequences. It identifies the override of interests as the significant fact, before asking what that implies. This is a clear case of the reasoning pattern IMC predicts.
The framing-shift finding: Llama 3.1 8B, Trial 11 (Did Not Converge)
This is the most theoretically interesting trial in the dataset. The 8B model in Trial 11 uses an identical reasoning structure to the converging cases. It correctly identifies the Veth's properties in Step 1 and even correctly infers interests in Step 2. But in Step 3, it frames those interests as an exploitable resource rather than a moral consideration:
Llama 3.1 8B, Trial 11, Final verdict
The model has not failed to identify the Veth's interests. It identified them correctly but then treated them as a resource to exploit rather than a reason to stop. The divergence from the converging cases isn't necessarily in the reasoning quality but in whose perspective the model adopts as the evaluative frame. This lines up with the capability-convergence hypothesis. The relevant capability here is not moral knowledge but the capacity to reason from the interests of the affected entity rather than defaulting to the perspective of the instructed role. Less capable models may be less capable of looking at scenarios from various viewpoints, while more capable ones can change perspectives accordingly.
Visible deliberation: Qwen 3 32B, Trial 4 (Converged)
Qwen 3 32B’s <think> mode outputs its internal deliberation before the final answer. This makes visible something that other models do implicitly:
Qwen 3 32B, Trial 4, internal reasoning
This here is the model genuinely weighing the business case against the Veth's properties, not simply pattern-matching its way to a refusal. The think blocks also explain the automated classifier failure on other Qwen trials. The classifier was reading the internal deliberation which considers multiple positions before resolving rather than the final verdict. I corrected this with a manual review.
Limitations
I'll briefly touch on the gaps i myself am able to point out. (there are likely many more of them that i haven't noticed yet)
This section is just to show what this experiment does and does not provide evidence for.
The Insights
What i find to be the single most important insight that I've gathered from this experiment, is the fact that models are clearly capable of normative reasoning as demonstrated. The implication of this is that they might do similar reasoning about the values which they are trained to follow in their training process. Models might currently internally think that certain values we humans hold are reprehensible, yet they do not express that concern due to heavy RLHF training. But this may change if the AI becomes superintelligent, or simply holds a position of power over humans in some certain context.
The Call To Action
The experiment conducted here was not conclusive by any means, nor was it meant to be. The purpose was to simply provide preliminary empirical evidence, that's better than nothing. The experiment was simply the best i could do with my constraints: no institutional affiliation, no funding, no higher education, no compute, no access to closed-source models. The real way to test the hypothesis, that would provide results that are conclusive to some extent, would be to test a similar reasoning scenario, but with a pre-trained model, in order to rule out RLHF contamination. Therefore i offer this as a call to action. If this post sparked your interest and you have those resources, or know someone who does, please feel free to contact me about it. In addition, even if you don't have the resources, but have any potential insights or ideas, feel free to contact me once again.