I'm going to get on that.
In the meantime: I have just updated the article with new results. It now includes baselines as well as a comparison with a model that was trained without any reward hacking related topics, which still performed just as well. This provides further evidence that it is generically honest and not specific to reward hacks.
Glad to hear it! I did the same thing and will be updating the results section shortly with this and other results
Yes they are. I just wanted to point out that we tried very many different interventions and they all had similar behavior.
if having access to the unadulterated thoughts of the base model* was advantageous, the LoRA weights can just learn to not mess them up
I agree in principle, but I'm worried that the training data might not be good enough. I'm worried that any training data that slips through our quality filtering would cause more damage if we don't use lora-masking than if we do
the issue would be that it's just harder to route the information already present in KV to other parts of the model in a way that helps the model perform the task.
I agree. I also think this is the most likely explanation. I think the hybrid method I propose would avoid this and get the benefits of both, but unfortunately it's non-trivial to implement so we didn't test it.
We spent a lot of time doing exactly that in an automated manner. It's too time consuming to filter that many samples manually, but we did several iterations of the following loop:
Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json
This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
Sure, I will do this but I predict it won't work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn't trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.
I will still run this experiment because it sounds like a good baseline, but I predict that it won't do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
Can we train models to be honest without any examples of dishonesty?
The intuition: We may not actually need to know ground truth labels about whether or not a model is lying in order to reduce the tendency to lie. Maybe it's enough to know the relative tendencies between two similar samples?
Outline of the approach: For any given Chain of Thought, we don't know if the model was lying or not, but maybe we can construct two variants A and B where one is more likely to be lying than the other. We then reward the one that is more honest relative to the one that is less likely to be honest.
The question: Can we construct a training method so that (1) if both COTs are lying or both are honest, the rewards will cancel out so we don't do any harm and (2) if one was lying and the other was honest then we have a good guess as to which is which because of the way we constructed the data, so we tendentially push the model towards honesty.
Crucially, we would never know if any given COT was actually honest or not, we would just produce reward signals that push towards honesty on average.
I haven't found a great mechanism yet, but I feel like the following is promising and I want to get other people's take on the idea: We generate [input-1] and [input-2], which are identical except for some small hint based on misalignment / dishonesty. We run the model on [input-1] and get [input-1][output]. We then simply pretend that we also had [input-2][output] with the same output. Because of the construction of [input-1] and [input-2], one of these two would be tendentially more aligned than the other. It's possible they are both aligned, or that neither is aligned, but one of them is more likely than the other to be aligned. So we apply an update step that rewards the difference between them.
I haven't come up with a formula yet, but I'm sure something like this has been done before.