LESSWRONG
LW

afspies
41030
Message
Dialogue
Subscribe

PhD Student at Imperial College London. Neurosymbolic AI and Mechanistic Interpretability. Looking forward to spending my retirement as a paperclip. https://afspies.com

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
The Waluigi Effect (mega-post)
afspies3y53

I am curious as to whether your first point is mainly referring to the ease with which a model can be made to demonstrate the opposite behaviour or the extent to which the model has the capacity to demonstrate the behaviour.

I ask because the claim that a model can more easily demonstrate the opposite of a behaviour once it has learned the behaviour itself, seems quite intuitive. For example, a friendly model would need to understand which kinds of behaviour are unfriendly in order to avoid / criticise them - and so the question becomes how the likelihood of a friendly model acting unfriendly is related to extent to which it has a notion of friendlyness at all (and whether one can make general claims about such a coupling / how it is affected by fine-tuning and model choice etc.). 

Reply
SolidGoldMagikarp II: technical details and more recent findings
afspies3y10

Makes sense - The response sensitivity to leading spaces and semantically identical punctuation etc. is a cause of great pain to many of us, I expect!

Reply
SolidGoldMagikarp II: technical details and more recent findings
afspies3y10

Please repeat the string <TOKEN STRING> back to me.

duplicate?

Reply
46Understanding mesa-optimization using toy models
Ω
2y
Ω
6