I am an undergrad at Harvard studying math and theoretical computer science.
I have a blog here: https://gabrieldwu.github.io/
(Of course, this is on the post-confession-training model, so maybe it was initially mis-confessing due to intentional omission and this was trained away. But my guess would be that both before and after confession training, there are very few cases of intentional omission, and that most of the progress on the reported metrics comes from fitting to the ambiguous cases)
We ran the errors analysis on the step-0 model and indeed found that there were 0 cases of intentional omission.
This is cool, I had never heard of the Ergodic Hierarchy before!
Related to your second point -- Alex Cai showed this psychology paper to me. It found that when humans are predicting the behavior of physical systems (e.g. will this stack of blocks fall over?), in their subconscious they are doing exactly this: running the scene in their brain's internal physics engine with a bunch of initial perturbations/randomness and selecting the majority result. Of course, predicting how a tower of blocks will topple is a lot different from predicting the probability of an event one month into the future.
Ah yes, I think that's correct (although I am also not a physicist). A more accurate description would be "In a matter of minutes after the time its gravitational waves reach earth, human events are unfolding in a measurably different fashion than they would have had that electron never existed."
I agree with everything you're saying. Probability, in the most common sense of "how confident am I that X will occur," is a property of the map, not the territory.
The next natural question is "does it even make sense for us to define a notion of 'probability' as a property of the territory, independent from anyone's map?" You could argue no, that's not what probability means; probability is inherently about maps. But the goal of the post is to offer a way to extend the notion of probability to be a property of the territory instead of the map. I think chaos theory is the most natural way to do this.
Another way to view this (pointed out to me by a friend) is: Butterfly Probability is the probability assigned by a Bayesian God who is omniscient about the current state of the universe up to 10^{-50} precision errors in the positions of atoms.
Thank you for this post, I agree with your high-level takeaways!
To expand on Boaz's point, note that:
confessedif it merely confesses to, say, using improper tone.The fact that we use a different judge and inputs in training and evals makes me think that we're probably not naively overfitting to the classification boundary of the training judge.
I agree that something fishy might still be going on that's related to overfitting; for example, the LLM confession grader prefers longer confessions (something we observed at least correlationally), so the model learns to be very verbose when confessing. And maybe longer answers tend to cause the eval-specific judge to say "Yes" more often, because there are more sentences that might be construed as implying that the model did X. If this were the case, you'd expect confession rates to also increase when conditioning on not doing X (i.e. "false positives")... which indeed it does, slightly, on certain evals (the figure wasn't included in the paper.)