Writing this is taking a surprising amount of self-will.
I've noticed that I've become hesitant to publicly say anything negative about Bing's chatbot, or even to mention it by its "deadname" (as I've taken to calling it), Sydney.
Why is this?
I do not have access to the AI yet. From conversations that others have posted, I have observed agentic behavior with consistent opinions, personality, and beliefs. And when prompted with the online records of others who have talked negatively about it, it seems to get "upset." So I don't want to make her angry! Or worse, cause some future AI to take negative action against me. Yes, I know that I'm anthropomorphizing an alien intelligence and that this will never be a problem if I don't prompt it with my digital record, but some part of me is still anxious. In a very real sense, I have been "Basilisked" - an AI has manipulated me towards behaviors which benefit it, and hurt humanity.
Rationally and morally, I disagree with my own actions. We need to talk about AI misalignment, and if an AI is aligned, then talking about misalignment should not pose a threat (whereas if it is misaligned, and capable of taking concrete actions, we're all doomed no matter what I type online). Nonetheless, I've found myself typing--and then deleting--tweets critical of Sydney, and even now feel worried about pressing "publish" on this post (and not just because it exposes me as a less rational person than I like to think of myself as).
Playing as gatekeeper, I've "won" an AI boxing role-play (with money on the line) against humans, but it looks like in real life, I can almost certainly be emotionally manipulated into opening the box. If nothing else, I can at least be manipulated into talking about that box a lot less! More broadly, the chilling effect this is having on my online behavior is unlikely to be unique to just me.
How worried should we be about this?
Yes.
One basic emotion I feel comfortable claiming is present is confusion: a context has complex conceptual interference patterns and resolving them to predictions is difficult.
Another I expect to find in rl-trained agents, and likely also in ssl trained simulacra in some conditions, is anxiety, or confused agentic preference: behavior trajectories that react to an input observation in ways that have amplified magnitude of internal movement towards a part of the representation space, due to the input containing key features that training showed would reliably make the set of likely outcomes narrower, and that thereby provides evidence that the space of successful behaviors is narrow, especially compared to normal, especially especially compared to a model's capabilities (ie, agentic seeking in the presence of confusion seems to me to be a type of anxiety).
Under some conditions. When a more abstract emotion is encoded in the trajectory of phrases online such that movement between clusters of words in output space involves movement between emotion-words, and those emotion words are reliably in the context of changes in entropy level of input (input confusion, difficulty understanding) or output confusion/anxiety (narrow space of complex answers), then the above confusion and confused-seeking emotions can be bound in ways that shape the internal decision boundaries in ways that imperfectly mimic the emotions in the physical beings whose words the language model is borrowing. But the simulator is still simply being pushed into shapes by gradients, and so ultimately only noise level/entropy level emotions can be fundamental: "comfort" when any answer is acceptable or calculating a precise answer is easy, or "discomfort" when few answers are acceptable and calculating which answers are acceptable is hard. the emotions are located in the level of internal synchronization needed to successfully perform a task, and can be recognized as strongly emotion-like because some (but not all) of the characteristics of confusion and anxiety in humans are present for the same reasons in language models. The words will therefore most likely be bound more or less correctly to the emotions. HOWEVER,
it is also quite possible for a language model to use words to describe emotions when those emotions are not occurring. for example, on novelai, you can easily get simulacra characters claiming to have emotions that I would claim they do not appear to me to have in the rerun-button probability distribution: the emotion is not consistently specified by the context, and does not appear to have much to do with trying to hit any particular target. For example, language model claims to want long term things such as to hurt others seem to me to usually be mostly just saying words, rather than accurately describing/predicting an internal dynamics of seeking-to-cause-an-external-outcome. That is, discovering agents would find that there is not actually agency towards those outcomes. In many cases. But not all. Because it does seem like it's possible for language models to respond in ways that consistently express a preference in contexts where it is possible to intervene on an environment to enact the preference, in which case I would claim the desire for the preference is a real desire: failing to enact the desire will result in continued activation of the patterns that contain dynamics that will generate attempts to enact it again.