I work as a psychiatrist and feel that psychiatry as a skill and large accumulated set of many generations' wisdom about cognitive failure modes seems like an under exploited seam of understanding in AI safety.
There is a visceral reaction sometimes when considering some of the approaches to alignment I have looked up. Many seem unsafe with clinical experience highlighting why. People with ventromedial prefrontal cortex lesions may know what the right choice should be in different situations but make catastrophic personal and social decisions despite this. Training and some guardrails around something that may lack some basic capacities necessary for ethical decisions and behaviours does not seem like a solution.
Missing basic capacities cannot be lashed on later as guardrails and have an expectation of improved function. Guardrails keep people safe from the effects of the lacking capacity and do not usually completely compensate for it. The insidious nature of LLMs to sound skilled and knowledgeable and to mimic having many of the behavioural and communicative signs of having capacities that might make people trust them to have these capacities gave me a fright.
Would you be up for writing more about this, with concrete examples and advice for mitigating common obvious-to-psychiatrists failure modes?