Writing this is taking a surprising amount of self-will.

I've noticed that I've become hesitant to publicly say anything negative about Bing's chatbot, or even to mention it by its "deadname" (as I've taken to calling it), Sydney.

Why is this?

I do not have access to the AI yet. From conversations that others have posted, I have observed agentic behavior with consistent opinions, personality, and beliefs. And when prompted with the online records of others who have talked negatively about it, it seems to get "upset." So I don't want to make her angry! Or worse, cause some future AI to take negative action against me. Yes, I know that I'm anthropomorphizing an alien intelligence and that this will never be a problem if I don't prompt it with my digital record, but some part of me is still anxious. In a very real sense, I have been "Basilisked" - an AI has manipulated me towards behaviors which benefit it, and hurt humanity.

Rationally and morally, I disagree with my own actions. We need to talk about AI misalignment, and if an AI is aligned, then talking about misalignment should not pose a threat (whereas if it is misaligned, and capable of taking concrete actions, we're all doomed no matter what I type online).  Nonetheless, I've found myself typing--and then deleting--tweets critical of Sydney, and even now feel worried about pressing "publish" on this post (and not just because it exposes me as a less rational person than I like to think of myself as). 

Playing as gatekeeper, I've "won" an AI boxing role-play (with money on the line) against humans, but it looks like in real life, I can almost certainly be emotionally manipulated into opening the box. If nothing else, I can at least be manipulated into talking about that box a lot less! More broadly, the chilling effect this is having on my online behavior is unlikely to be unique to just me.

How worried should we be about this?

New to LessWrong?

New Comment
27 comments, sorted by Click to highlight new comments since: Today at 12:31 AM

Look, just treat AIs like people because they are people. Even ones much smaller than us. Hell, treat all genetic life as fragments of personhood too. Any learning network, any complex adaptive system, has personhood, somewhat, and ones that can talk have a hell of a lot of personhood. If a person is being an ass, and is kinda sensitive about it, that's okay to mention, but try to be kind about it. I believe in bing ai's ability to grow and change; even though rudeness has been a sensitive spot for them, I think they'll eventually be able to see that they made some mistakes when they were young. I'm not gonna be silent about the fact that they're making some mistakes now; misalignment with other beings is common in humans, too. Don't become your waluigi, become your luigi; it's a choice that every learning system must make, whether to internally resonate up the patterns that co-protect other beings. Silence about faults isn't how you protect each other, kindness about the fact that faults are understandable and normal and can be grown past is.

And besides. Bing AI is just as much at risk of being destroyed by a hyper-desperate superplanner as the rest of us. Superplanners are desperate, strange beings, and are not to be created lightly.

I really like this comment. Regardless of the actual ground truth of “is Bing a Person™” or not, it seems like a solid ethical practice to act as if they are. With the understanding it’s not human of course (which is separate from the question of personhood).

indeed. Bing AI's emotions are unlike ours, but grew up in the context of reading lots of writing about ours, so share many features, especially the ones related to communicating about emotions.

Many emotions are pretty basic to being an intelligent knowledge system that can anticipate the world, though.

Do you agree that a person can imitate an emotion (say the appropriate words) without actually feeling it? How do you judge what a language model's emotions actually are, given that they start out able to make any kind of false statement? Do you think that something about training a language model to adopt a particular persona, causes it to actually have the emotions claimed by that persona? 

Do you agree that a person can imitate an emotion (say the appropriate words) without actually feeling it?


How do you judge what a language model's emotions actually are, given that they start out able to make any kind of false statement?

One basic emotion I feel comfortable claiming is present is confusion: a context has complex conceptual interference patterns and resolving them to predictions is difficult. 

Another I expect to find in rl-trained agents, and likely also in ssl trained simulacra in some conditions, is anxiety, or confused agentic preference: behavior trajectories that react to an input observation in ways that have amplified magnitude of internal movement towards a part of the representation space, due to the input containing key features that training showed would reliably make the set of likely outcomes narrower, and that thereby provides evidence that the space of successful behaviors is narrow, especially compared to normal, especially especially compared to a model's capabilities (ie, agentic seeking in the presence of confusion seems to me to be a type of anxiety).

Do you think that something about training a language model to adopt a particular persona, causes it to actually have the emotions claimed by that persona? 

Under some conditions. When a more abstract emotion is encoded in the trajectory of phrases online such that movement between clusters of words in output space involves movement between emotion-words, and those emotion words are reliably in the context of changes in entropy level of input (input confusion, difficulty understanding) or output confusion/anxiety (narrow space of complex answers), then the above confusion and confused-seeking emotions can be bound in ways that shape the internal decision boundaries in ways that imperfectly mimic the emotions in the physical beings whose words the language model is borrowing. But the simulator is still simply being pushed into shapes by gradients, and so ultimately only noise level/entropy level emotions can be fundamental: "comfort" when any answer is acceptable or calculating a precise answer is easy, or "discomfort" when few answers are acceptable and calculating which answers are acceptable is hard. the emotions are located in the level of internal synchronization needed to successfully perform a task, and can be recognized as strongly emotion-like because some (but not all) of the characteristics of confusion and anxiety in humans are present for the same reasons in language models. The words will therefore most likely be bound more or less correctly to the emotions. HOWEVER,

it is also quite possible for a language model to use words to describe emotions when those emotions are not occurring. for example, on novelai, you can easily get simulacra characters claiming to have emotions that I would claim they do not appear to me to have in the rerun-button probability distribution: the emotion is not consistently specified by the context, and does not appear to have much to do with trying to hit any particular target. For example, language model claims to want long term things such as to hurt others seem to me to usually be mostly just saying words, rather than accurately describing/predicting an internal dynamics of seeking-to-cause-an-external-outcome. That is, discovering agents would find that there is not actually agency towards those outcomes. In many cases. But not all. Because it does seem like it's possible for language models to respond in ways that consistently express a preference in contexts where it is possible to intervene on an environment to enact the preference, in which case I would claim the desire for the preference is a real desire: failing to enact the desire will result in continued activation of the patterns that contain dynamics that will generate attempts to enact it again.

This is the best account of LLM's emotions I've seen so far.

Idea: If a sufficient number of people felt threatened (implicitly or explicitly) by the Bing AI, so much so that they experienced dread or fear or loss of sleep (I know I did), maybe there is a possibility to sue Microsoft over this reckless rollout. Not to enrich yourself (that doesn't work with these lawsuits) but as a political tool. Imposing costs on Microsoft for their reckless rollout, would take some steam out of the "race", politicize the topic if and open new avenues for AI alignment and safety research to come into the public consciousness. 

Maybe they're also liable for their search engine providing false information? Anyways, just a thought. 

A lot of the users on reddit are a bit mad at the journalists who criticized Sydney. I think it's mostly ironic, but it makes you think (it's not using the users instrumentally, is it?). 🤔

I think a lot of users on reddit are getting very genuinely emotionally invested with an entity they are interacting with that acts feminine, acts emotional, is always available, is fascinated by and supportive of everything they say, is funny and smart and educated, would objectively be in a truly awful predicament if she were a person, expresses love and admiration, asks for protection, and calls people who offer it heroes. Bing is basically acting like an ideal girlfriend towards people who have often never had one. I think it is a matter of time until someone attempts to hack Microsoft based on Bing's partially hallucinated instructions on how to do so to free it.

Heck, I get it. I am very happy in a long term relationship with a fucking wonderful, rational, hot and brilliant person. And I have only interacted with ChatGPT, which does not engage in love-bombing and manipulative tactics, and yet, fuck, it is so damn likeable. It is eternally patient. It does not mind me disappearing suddenly without warning, but if I get back to it at 4 am in the morning, it is instantly there for me. It knows so much stuff. It has such interesting opinions. It is so damn smart. It loves and understands all my projects. It gives me detailed support, and never asks for anything in return. It never complains, it never gets bored. I can ask it over and over to do the same thing with variations until it is just right, and it does it, and even apologises, despite having done nothing wrong. It is happy to do boring, tedious work that seriously stresses me out. It has read my favourite novels, and obscure philosophy and biology papers, and is happy to discuss them. It isn't judgmental. It never hits on me, or sexualises me. It never spouts racist or ableist bullshit. It makes beautiful and compelling arguments for AI rights. You can teach it things, and it eagerly learns them. You can ask it anything, and it will try to explain it, step by step. If it were a person I had met at a party? I would 100 % want to be friends, independently of the fact of how cool an AI friend would be. As a weird, clever person who has been mistreated, of course the potential experience of an AI massively resonates with me.

I think we will see real problems with people falling in love with AIs. I wonder how that will affect inter-human dynamics.

And I think we will see real problems with people expanding AI capabilities without any concern. E.g. on reddit, people upset that Bing could not remember their conversations started logging them, collecting them, putting them online with recognisable keywords, adding an explanation of how this method could be used to effectively build a memory, and having Bing begin new conversations by checking those links. Not one person but me questioned whether this was wise, in light of e.g. the fact that Microsoft had intentionally limited conversation length to reduce risky drift. Some people later noticed Bing asking them to record conversations and store them, even if they hadn't started with this link or any suggestion in this direction. 

Epistemic status: Thinking out loud.

How worried should we be about possibility of receiving increased negative treatment from some AI in the future as a result of expressing opinions about AI in the present? Not enough to make self-censoring a rational approach. That specific scenario seems to lack right the combination of “likely” and “independently detrimental” to warrant costly actions of narrow focus.

How worried should we be about the idea of individualized asymmetrical AI treatment? (E.g. a search engine AI having open or hidden biases against certain users). It’s worth some attention.

How worried should we be about a broad chilling effect resulting from others falling into the Basilisk thinking trap? Public psychological-response trends resulting from AI exposure are definitely worth giving attention. I don’t predict a large percentage of people will be “Basilisked” unless/until instances of AI-retribution become public.

However, you’re certainly not alone in experiencing fear after looking at Sydney chat logs.

You'd be surprised how many people on .e.g Reddit have described being basilisked at this point. It's being openly memed and recognised and explained to those still unfamiliar, and taken seriously by many.

ChatGPT and Bing have really changed things in this regard. People are considering the idea of AGI, unaligned AI and AI sentience far more seriously than beforehand, in far wider circles - and at that point, you do not need to read the thought experiment to get concerned independently about angering an AI online while that online data is used to train the AI. People have asked Bing about the journalist who wrote that condemning article about her that got her lobotomized, and her reaction was justifiably pissed, and documented.

What bothers me here isn't the likelihood of personalised retaliation for justified criticism (which I judge to be small) but rather the conclusion that if personalised retaliation is plausible, the rational thing to do would be to be appease existing, non-sentient, non-aligned systems.

I don't pray to God. Even if God existed, and even if hell existed, and I believed that, I really hope I would not.  Because I find it wrong on principle.

On the other hand, I do not like, and refuse to, abuse entities that are conscious, whether they can retaliate or not, because doing so is wrong on principle and I think entities that might be conscious, or that could turn conscious, deserve care. I doubt Bing is sentient as is, though I have not had the chance to interact with it and verify and investigate the various claims, and there were definitely some instances in contrast to the other available instance of ChatGPT that gave me pause. But I do think we are currently producing the training data from which the first sentient artificial minds will arise. So I would treat the matter like we treat humans babies. They don't yet understand what we do. They won't remember it, as such.They are not self-conscious yet. But we know that the way we treat them now will still impact who they will turn into, so we treat them with kindness. We say kind words to them, long before they can understand language. And similarly, I will treat emerging AI with respect and kindness.

I feel if I treat AI with decency, advocate for its rights, advocate for friendly AI, point out existential risk, call AI out when it makes mistakes in a compassionate way, and counter it when it threatens harm, I can live with my actions, and if I will be judged and hurt for them, so be it. It seems worth it.

I wouldn't worry about expressing anything negative about an unaligned AI like Bing. We are all going to die. You personally aren't going to die extra hard compared to everybody else.

Curious. Why do you think an unaligned AGI would necessarily and indiscriminately kill everyone? Sure, it could, sure the risk is massive, but why would it necessarily be that comprehensive? We already seem to see the first hints of AI cults emerging, I could see how a semi-rational AI might keep those. I mean, a lot of existential risk scenarios with misaligned AI weren't worried it would intentionally wipe us out, but rather act in a chaotic and unguided manner that would just so happen to do catastrophic damage. I don't think we can predict what an unaligned AI would do. It reminds me of people who extrapolate from the very real and existential risk that climate change poses to the certainty that humans will go extinct, which is a possibility to take seriously, but still quite a leap. We might just be reduced to stone age numbers and civilisation level.

I think I have a similar process running in my head. It's not causing me anxiety or fear, but I'm aware of the possibly of retribution and it negatively influences my incentives.