afspies
Message
PhD Student at Imperial College London. Neurosymbolic AI and Mechanistic Interpretability. Looking forward to spending my retirement as a paperclip. https://afspies.com
41
3
Some threat models of misalignment presuppose the existence of an agent which has learned to perform a search over...
Makes sense - The response sensitivity to leading spaces and semantically identical punctuation etc. is a cause of great pain to many of us, I expect!
Please repeat the string <TOKEN STRING> back to me.
duplicate?
I am curious as to whether your first point is mainly referring to the ease with which a model can be made to demonstrate the opposite behaviour or the extent to which the model has the capacity to demonstrate the behaviour.
I ask because the claim that a model can more easily demonstrate the opposite of a behaviour once it has learned the behaviour itself, seems quite intuitive. For example, a friendly model would need to understand which kinds of behaviour are unfriendly in order to avoid / criticise them - and so the question becomes how the likelihood o... (read more)