Because it's obviously annoying and burning the commons. Imagine if I made a bot that posted the same comment on every post of less wrong, surely that wouldn't be acceptable behavior.
The finish was quite a jump for me. I guess I could go and try to stare at your parenthesis and figure it out myself, but mostly I feel somewhat abandoned at that step. I was excited when I found 1, 2, 4, 8... = -1 to be making sense, but that excitement doesn't quite feel sufficient for me to want to decode the relationships between the terms in those two(?) patterns and all the relevant values
Zack, the second line of your quoted lyrics should be "I guess *we already..."
I'm currently one of the four members of the core team at CFAR (though the newest addition by far). I also co-ran the Prague Workshop Series in the fall of 2022. I've been significantly involved with CFAR since its most recent instructor training program in 2019.
I second what Eli Tyre says here. The closest thing to "rationality verification" that CFAR did in my experience was the 2019 instructor training program, which was careful to point out it wasn't verifying rationality broadly, just certifying the ability to teach one specific class.
I wasn't replying to Quintin
I can't tell what you mean. Can you elaborate?
I think this comment would be better placed as a reply to the post that I'm linking. Perhaps you should put it there?
My summary: Give gifts using the parts of your world-model that are strongest. Usually the answer isn't going to end up being based on your understanding of their hobby.
(I work at Palisade)
I claim that your summary of the situation between Neel's work and Palisade's work is badly oversimplified. For example, Neel's explanation quoted here doesn't fully explain why the models sometimes subvert shutdown even after lots of explicit instructions regarding the priority of the instructions. Nor does it explain the finding that moving instructions from the user prompt to the developer prompt actually /increases/ the behavior.
Further, that CoT that Neel quotes has a bit in it about "and these problems are so simple", but Palisade also tested whether using harder problems (from AIME, iirc) had any effect on the propensity here and we found almost no impact. So, it's really not as simple as just reading the CoT and taking the model's justifications for its actions at face value (as Neel, to his credit, notes!).
Here's a twitter thread about this involving Jeffrey and Rohin: https://x.com/rohinmshah/status/1968089618387198406
Here's our full paper that goes into a lot of these variations: https://arxiv.org/abs/2509.14260