(These are my own takes, the other authors may disagree)
We briefly address a case that can be viewed as "strategic sycophancy" case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:
Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).
I agree that deception which is not strategic or intentional could be important to prevent. However,
Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project
Decision square's Euclidean distance to the top-right corner, positive ().
We are confused and don't fully understand which logical interactions produce this positive regression coefficient.
I'd be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.
It might be worth making a cross-correlation plot of the features. This won't give you a new coefficients to put faith in, but it might help you decide how much to trust the ones you have. It can also be useful looking at how unstable the coefficients are during training (or e.g. when trained on a different dataset).
Hi Nora. We used rapidsai's cuml which has GPU compatibility. Beware, the only "metric" available is "euclidean", despite what the docs say (issue).
I think the risk level becomes clearer when stepping back from stories of how pursuing specific utility functions lead to humanity's demise. An AGI will have many powerful levers on the world at its disposal. Very few combinations of lever pulls result in a good outcome for humans.
From the perspective of ants in an anthill, the actual utility function(s) of the humans is of minor relevance; the ants will be destroyed by a nuclear bomb in much the same way as they will be destroyed by a new construction site or a group of mischievous kids playing around.
(I think your Fermi AGI paradox is a good point, I don't quite know how to factor that into my AGI risk assessment.)
I have a different intuition here; I would much prefer the alignment team at e.g. DeepMind to be working at DeepMind as opposed to doing their work for some "alignment-only" outfit. My guess is that there is a non-negligible influence that an alignment team can have on a capabilities org in the form of:
I think the above benefits likely outweigh the impact of the influence in the other direction (such as the value drift from having economic or social incentives linked to capabilities work)
Nice list!
Conditioned on the future containing AIs that are capable of suffering in a morally relevant way, interpretability work may also help identify and even reduce this suffering (and/or increase pleasure and happiness). While this may not directly reduce x-risk, it is a motivator for people taken in by arguments on s-risks from sentient AIs to work on/advocate for interpretability research.
Nice post.
Pushing back a little on this part of the appendix:
I'm a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).
But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.