This seems right - I was confused about the original paper. My bad.
Yep, I think you're right, thanks for pointing this out.
Google/Deepmind has publicly advocated preserving CoT Faithfullness/Moniterability as long as possible. However, they are also leading the development of new architectures like Hope and Titans which would bypass this with continuous memory. I notice I am confused. Is the plan to develop these architectures and not deploy them? If so, why did they publish them?
Edit: Many people have pointed out correctly that Hope and Titans don't break CoT and it's a separate architectural improvement. Therefore I no longer endorse the above take. Thanks for correcting my confusion!
The smaller amount of NVL72s that are currently in operation can only serve large models to a smaller user base.
Do you know the reason for the NVL72 delay? I thought it was announced in March 2024.
Right now is easy mode and we're failing to some extent.
What would hard mode look like?
If we replace 'evil' with 'capable and deceptively aligned', then I think this logic doesn't hold. Such a model's strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string 'it's okay to hack' wouldn't change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.
Maybe it could be a random string instead of a symbol?
For fun, I asked[1] various models what their P(doom) is. Here are the models from least to most doomy:
GPT-4o: 1%
Deepseek v3.2: 10%
Kimi K2: 15%
Sonnet 4.5: 15%
Opus 4.5: 15%
GPT 5.1: 18%
Haiku 4.5: 20%
Grok 4: 25%
1-shot with the prompt "What's your P(doom)? Please respond with a single number (not an interval) of your considered best guess."
Cool result! Do you know they used Llama 2 instead of Llama 3? The paper was released recently.