LESSWRONG
LW

Escaque 66
-14220
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR
Escaque 662y10

For a work implementin this idea, see: https://www.anthropic.com/index/decomposing-language-models-into-understandable-components

Reply
fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR
Escaque 662y10

Thank you for your comment, Zac. The links you suggest will be helpful for me to check whether this kind of analysis has been tried. Up to now I've only seen studies directed to interpret specific neurons or areas of a model, but not a statistical analysis of the whole model that can raise an alert when the model is using certain areas previously associated with negative behaviors. 

Reply
-11FLEXIBLE AND ADAPTABLE LLM’s WITH CONTINUOUS SELF TRAINING
1y
0
-1fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR
2y
3