Maybe the result of one person’s clones forming a very capable Em Collective would still be suboptimal and undemocratic from the perspective of the rest of humanity, but it wouldn’t kill everyone, and I think wouldn’t lead to especially bad outcomes if you start from the right person.
I think the risk of a homogeneous collective of many instances of a single person's consciousness is more serious than "suboptimal and undemocratic" suggests. Even assuming you could find a perfectly well-intentioned person to clone, identical minds share the same blindspots and biases. Without diversity of perspective, even earnestly benevolent ideas could—and I imagine would—lead to unintented catastrophe.
I also wonder how you would identify the right person, as I can't think of anyone I would trust with that degree of power.
I agree that "psychosis" is probably not a great term for this. "Mania" feels closer to what the typical case is like. It would be nice to have an actual psychiatrist weigh in.
I’m a clinical psychology PhD student. Please take the following as psychoeducation, and with the strong caveats that (1) I cannot diagnose without supervision as I am not an independently licensed practitioner and (2) I would not have enough information to diagnose here even if I were.
Mania and psychosis are not mutually exclusive. Under the DSM-5-TR, Bipolar I has two relevant specifiers: (a) with mood-congruent psychotic features, and (b) with mood-incongruent psychotic features. The symptoms described here would be characterized as mood-congruent (though this does not imply the individual would meet criteria for Bipolar I or any other mental disorder).
Of course, differential diagnosis for any sort of “AI psychosis” case would involve considering multiple schizophrenia spectrum and other psychotic disorders as well, alongside a thorough consideration of substance use, medication history, physical health, life circumstances, etc. to determine which diagnosis—if any—provides the most parsimonious explanation for the described symptoms. Like most classification schemes, diagnostic categories are imperfect and useful to the extent that they serve their function in a given context.
Hi all! My name is Annabelle. I’m a Clinical Psychology PhD student primarily studying psychosocial correlates of substance use using intensive longitudinal designs. Other research areas include Borderline Personality Disorder, stigma and prejudice, and Self-Determination Theory.
I happened upon LessWrong while looking into AI alignment research and am impressed with the quality of discussion here. While I lack some relevant domain knowledge, I am eagerly working through the Sequences and have downloaded/accessed some computer science and machine learning textbooks to get started.
I have questions for LW veterans and would appreciate guidance on where best to ask them. Here are two:
(1) Has anyone documented how to successfully reduce—and maintain reduced—sycophancy over multi-turn conversations with LLMs? I learn through Socratic questioning, but models seem to “interpret” this as seeking validation and become increasingly agreeable in response. When I try to correct for this (using prompts like “think critically and anticipate counterarguments,” “maintain multiple hypotheses throughout and assign probabilities to them,” and “assume I will detect sycophancy”), I’ve found models overcorrect and become excessively critical/contrarian without engaging in improved critical thinking.[1] I understand this is an inherent problem of RLHF.
(2) Have there been any recent discussions about navigating practical life decisions under the assumption of a high p(doom)? I’ve read Eliezer’s death with dignity piece and reviewed the alignment problem mental health resources, but am interested in how others are behaviourally updating on this. It seems bunkers and excessive planning are probably futile and/or a poor use of time and effort, but are people reconsidering demanding careers? To what extent is a delayed gratification approach less compelling nowadays, and how might this impact financial management? Do people have any sort of low-cost, short-term-suffering-mitigation plans in place for the possible window between the emergence of ASI and people dying/getting uploaded/other horrors?
I have many more questions about a broad range of topics—many of which do not concern AI—and would appreciate guidance about norms re: how many questions are appropriate per comment, whether there are better venues for specific sorts of questions, etc.
Thank you, and I look forward to engaging with the community!
I find Claude’s models are more prone to sycophancy/contrarianism than Perplexity’s Sonar model, but I would prefer to use Claude as it seems superior in other important respects.
Note that this "childhood self" does not seem to be particularly based on anything endogenous to the user (who has barely provided any details thus far, though it's possible more details are saved in memory), but is instead mythologized by ChatGPT in a long exercise in creative writing. The user even abdicates his side of the interaction with it to the AI (at the AI's suggestion).
Indeed, these parasitic LLM instances appear to rely almost exclusively on Barnum statements to "hook" users. Cue second-order AI psychosis from AI-generated horoscopes...
Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?
No, I don’t think anyone could, barring the highly unlikely case of a superintelligence perfectly aligned with human values (and even still, maybe not—human values are inconsistent and contradictory). Also, I think a system of democratically constructed values would probably be at odds with rational insight, unfortunately.
Regarding the rest, agreed. Heading into verbotten-ish political territory here, but see also Jenny Holzer and Chomsky on this.