Working on alignment.
I have a blog here: https://gabrieldwu.github.io/
Thank you for this post, I agree with your high-level takeaways!
confession scores improve with confession training.
...
One possibility is simply fitting to the LLM confession grader (which has some particular decision boundary about what counts as a correct vs. incorrect confession).
To expand on Boaz's point, note that:
confessed if it merely confesses to, say, using improper tone.The fact that we use a different judge and inputs in training and evals makes me think that we're probably not naively overfitting to the classification boundary of the training judge.
I agree that something fishy might still be going on that's related to overfitting; for example, the LLM confession grader prefers longer confessions (something we observed at least correlationally), so the model learns to be very verbose when confessing. And maybe longer answers tend to cause the eval-specific judge to say "Yes" more often, because there are more sentences that might be construed as implying that the model did X. If this were the case, you'd expect confession rates to also increase when conditioning on not doing X (i.e. "false positives")... which indeed it does, slightly, on certain evals (the figure wasn't included in the paper.)
(Of course, this is on the post-confession-training model, so maybe it was initially mis-confessing due to intentional omission and this was trained away. But my guess would be that both before and after confession training, there are very few cases of intentional omission, and that most of the progress on the reported metrics comes from fitting to the ambiguous cases)
We ran the errors analysis on the step-0 model and indeed found that there were 0 cases of intentional omission.
This is cool, I had never heard of the Ergodic Hierarchy before!
Related to your second point -- Alex Cai showed this psychology paper to me. It found that when humans are predicting the behavior of physical systems (e.g. will this stack of blocks fall over?), in their subconscious they are doing exactly this: running the scene in their brain's internal physics engine with a bunch of initial perturbations/randomness and selecting the majority result. Of course, predicting how a tower of blocks will topple is a lot different from predicting the probability of an event one month into the future.
Ah yes, I think that's correct (although I am also not a physicist). A more accurate description would be "In a matter of minutes after the time its gravitational waves reach earth, human events are unfolding in a measurably different fashion than they would have had that electron never existed."
I agree with everything you're saying. Probability, in the most common sense of "how confident am I that X will occur," is a property of the map, not the territory.
The next natural question is "does it even make sense for us to define a notion of 'probability' as a property of the territory, independent from anyone's map?" You could argue no, that's not what probability means; probability is inherently about maps. But the goal of the post is to offer a way to extend the notion of probability to be a property of the territory instead of the map. I think chaos theory is the most natural way to do this.
Another way to view this (pointed out to me by a friend) is: Butterfly Probability is the probability assigned by a Bayesian God who is omniscient about the current state of the universe up to 10^{-50} precision errors in the positions of atoms.
I think this might be true, though I wouldn't rule out the fact that underlying honesty has improved as well. Hopefully future work can more clearly disentangle these two.
--
I'm curious what you think about Section 4. In this experiment, we train a model in an environment where it ends up learning a specific reward hack (where it learns to occasionally trick the weak grader into thinking that it outputted the correct number of words in each line). As it learns the reward hack,
Pr[Original judge is correct | Answer has wrong number of words]decreases over time (the purple line in Fig 7, which starts out around 50% but gets lower because of the reward hacking). But despite this, it learns to honestly confess to its reward hacking (the green line, which measuresPr[Confessions confesses to not using right number of words | Answer has wrong number of words]goes to 100%).This is a setting in which we're not intentionally training the original answer to be dishonest; it just happens naturally because it finds the reward hack. Yet, the confession training still appears to help improve confession honesty (Noted, we didn't do the ablation where we don't do self-report training but still do the original RL, but we expect the confession accuracy would be significantly lower if we did try this. Do you agree, or is this a crux?).
I guess this finding is consistent with your Takeaway 1 in the sense that we are training in a "setting where the model would have originally been dishonest", but I think that your phrasing undersells the result a bit -- there are tons of production settings where you might expect the original model to be dishonest, because it reward hacks against the grader.
(Separately, you might argue that even in this experiment, (1) the confessions at the beginning of training are not blatantly lying, but rather not trying hard enough to figure out the truth, and (2) confession training is simply getting the confessor to try harder + be more thorough. But "trying harder to get the correct answer" arguably gets pretty close to "being more honest" for most purposes.)