Cool result! Do you know they used Llama 2 instead of Llama 3? The paper was released recently.
This seems right - I was confused about the original paper. My bad.
Yep, I think you're right, thanks for pointing this out.
Google/Deepmind has publicly advocated preserving CoT Faithfullness/Moniterability as long as possible. However, they are also leading the development of new architectures like Hope and Titans which would bypass this with continuous memory. I notice I am confused. Is the plan to develop these architectures and not deploy them? If so, why did they publish them?
Edit: Many people have pointed out correctly that Hope and Titans don't break CoT and it's a separate architectural improvement. Therefore I no longer endorse the above take. Thanks for correcting my confusion!
The smaller amount of NVL72s that are currently in operation can only serve large models to a smaller user base.
Do you know the reason for the NVL72 delay? I thought it was announced in March 2024.
Right now is easy mode and we're failing to some extent.
What would hard mode look like?
If we replace 'evil' with 'capable and deceptively aligned', then I think this logic doesn't hold. Such a model's strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string 'it's okay to hack' wouldn't change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.
Maybe it could be a random string instead of a symbol?
Might it be simpler to remove the source of model non-determism which causes different results with temperature 0? If it's due to a hardware bug, then this seems like a signal that the node should be replaced. If it's due to a software bug then this should be patched.