[EDIT 11 FEB 2026 AEST: These results are not interpretable. I am working on new analysis.
Where can I get the corrected results?
I will update with a link to new analysis once I have built an evaluation set and calibrated our classifier on it.
Why aren't the original results human-interpretable?
In the past I've used LLMs as zero-shot text classifiers on simple tasks and assumed it would carry over well to this experiment.
However, I reran the trait classifier with different LLMs and received very different results. Several classes moved by ~20-30%, including the "desire for self-improvement" trait, though it often remains in the top ten.
I hypothesise that this unexpected failure comes about because our safety-relevant... (read 481 more words →)
@zroe1 You may be interested to see that the classifier calibration turned out to be a critical error in my analysis! I've put a note at the top of the page. Hopefully this is of some value. I can let you know when I post the corrected analysis if you want.