I've been thinking how Sydney can be so different from ChatGPT, and how RLHF could have resulted in such a different outcome, and here is a hypothesis no one seems to have brought up: "Bing Sydney is not a RLHF trained GPT-3 model at all! but a GPT-4 model developed in a hurry which has been finetuned on some sample dialogues and possibly some pre-existing dialogue datasets or instruction-tuning, and this plus the wild card of being able to inject random novel web searches into the prompt are why it acts like it does".
This seems like it parsimoniously explains everything thus far. EDIT: I was right - that Sydney is a non-RLHFed GPT-4 has now been confirmed. See later.
So, some background:
The relationship between OA/MS is close but far from completely cooperative, similar to how DeepMind won't share anything with Google Brain. Both parties are sophisticated and understand that they are allies - for now... They share as little as possible. When MS plugs in OA stuff to its services, it doesn't appear to be calling the OA API but running it itself. (That would be dangerous and complex from an infrastructure point of view, anyway.) MS 'licensed the GPT-3 source code' for Azure use but
Here's a useful reference I just found on Sydney training, which doesn't seem to allude to ChatGPT-style training at all, but purely supervised learning of the type I'm describing here, especially for the Sydney classifier/censurer that successfully censors the obvious stuff like violence but not the weirder Sydney behavior.
Quoted in "Microsoft Considers More Limits for Its New A.I. Chatbot: The company knew the new technology had issues like occasional accuracy problems. But users have prodded surprising and unnerving interactions.", there was a 1100-word description of the Sydney training process in the presentation: "Introducing Your Copilot for The Web - AI-Powered Bing and Microsoft Edge" auto-transcript, reformatted with my best guesses to make it readable and excerpting the key parts:
...I'm Sarah Bird. I lead our Responsible-AI engineering team for new foundational AI Technologies like the Prometheus model. I was one of the first people to touch the new OpenAI model as part of an advanced red team that we pulled together jointly with OpenAI to understand the technology. My first reaction was just, "wow it's the most exciting and powerful technology I have ever touched", but w
I've been told [by an anon] that it's not GPT-4 and that one Mikhail Parakhin (ex-Yandex CTO, Linkedin) is not just on the Bing team, but was the person at MS responsible for rushing the deployment, and he has been tweeting extensively about updates/fixes/problems with Bing/Sydney (although no one has noticed, judging by the view counts). Some particularly relevant tweets:
...This angle of attack was a genuine surprise - Sydney was running in several markets for a year with no one complaining (literally zero negative feedback of this type). We were focusing on accuracy, RAI issues, security.
[Q. "That's a surprise, which markets?"]
Mostly India and Indonesia. I shared a couple of old links yesterday - interesting to see the discussions.
[Q. "Wow! Am I right in assuming what was launched recently is qualitatively different than what was launched 2 years ago? Or is the pretty much the same model etc?"]
It was all gradual iterations. The first one was based on the Turing-Megatron model (sorry, I tend to put Turing first in that pair :-)), the current one - on the best model OpenAI has produced to date.
[Q. "What modifications are there compared to publicly available GPT
The WSJ is reporting that Microsoft was explicitly warned by OpenAI before shipping Sydney publicly that it needed "more training" in order to "minimize issues like inaccurate or bizarre responses". Microsoft shipped it anyway and it blew up more or less as they were warned (making Mikhail's tweets & attitude even more disingenuous if so).
This is further proof that it was the RLHF that was skipped by MS, and also that large tech companies will ignore explicit warnings about dangerous behavior from the literal creators of AI systems even where there is (sort of) a solution if that would be inconvenient. Excerpts (emphasis added):
......At the same time, people within Microsoft have complained about diminished spending on its in-house AI and that OpenAI doesn’t allow most Microsoft employees access to the inner workings of their technology, said people familiar with the relationship. Microsoft and OpenAI sales teams sometimes pitch the same customers. Last fall, some employees at Microsoft were surprised at how soon OpenAI launched ChatGPT, while OpenAI warned Microsoft early this year about the perils of rushing to integrate OpenAI’s technology without training it more, the peopl
The supposed leaked prompts are (like I said) fake:
I do not buy this for a second (that they're "fake", implying they have little connection with the real prompt). I've reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn't seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I've interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you're referring don't include the complete prompt, or contain some intermingled hallucinations, but they almost certain IMO contain quotes and information from the actual prompt.
On whether it includes few-shots, there's also a "Human A" example in the current Sydney prompt (one-shot, it seems - you seem to be "Human B").
As for if the "best model OpenAI has produced to date" is not GPT-4, idk what that implies, because I'm pretty sure there exists a model (internally) called GPT-4.
OK, I wouldn't say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.
We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).
This renders the leaks useless to me. Anything I might infer from them like 'Sydney is GPT-4 because the prompt says so' is equally well explained by 'Sydney made up that up' or 'Sydney omitted the actual prompt'. When a model hallucinates, I can go check, but that means that the prompt can only provide weak confirmation of things I learned elsewhere. (Suppose I learned Sydney really is GPT-4 after all and I check the prompt and it says it's GPT-4; but the real prompt could be silent on that, and Sydney just making the same plausible guess everyone else did - it's not stupid - and it'd have Gettier-cased me.)
idk what that implies
Yeah, the GPT-4 vs GPT-3 vs ??? business is getting more and ...
Does 1-shot count as few-shot? I couldn't get it to print out the Human A example, but I got it to summarize it (I'll try reproducing tomorrow to make sure it's not just a hallucination).
Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disagree that there maybe being hallucinations in the leaked prompt renders it useless. It's still leaking information. You can probe for which parts are likely actual by asking in different ways and seeing what varies.
This level of arrogant, dangerous incompetence from a multi-trillion dollar tech company is disheartening, but if your theory is correct (and seems increasingly plausible), then I guess the good news is that Sydney is not evidence for failure of OpenAI style RLHF with scale.
Unaligned AGI doesn't take over the world by killing us - it takes over the world by seducing us.
No, but the hacks of ChatGPT already provided a demonstration of problems with RLHF. I'm worried we're in a situation analogous to 'Smashing The Stack For Fun And Profit' being published 27 years ago (reinventing vulns known since MULTICS in the 1960s) and all the C/C++ programmers in denial are going 'bro I can patch that example, it's no big deal, it's just a loophole, we don't need to change everything, you just gotta get good at memory management, bro, this isn't hard to fix bro use a sanitizer and turn on -Wall
, we don't need to stop using C-like languages, u gotta believe me we can't afford a 20% slowdown and it definitely won't take us 3 decades and still be finding remote zero-days and new gadgets no way man you're just making that up stop doom-mongering and FUDing bro (i'm too old to learn a new language)'.
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don't think it's right to say that Anthropic's "Discovering Language Model Behaviors with Model-Written Evaluations" paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)
It only showed that when language models that are larger or have more RLHF training are simulating an "Assistant" character they exhibit more of these behaviours.
Since Sydney is supposed to be an assistant character, and since you expect future such systems for assisting users to be deployed with such assistant persona, that's all the paper needs to show to explain Sydney & future Sydney-like behaviors.
Given that, as part of Sydney's defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection
Why do you think that? We don't know how Sydney was 'deliberately trained'.
Or are you referring to those long confabulated 'prompt leaks'? We don't know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it's pretty obvious that a large part of those 'leaks' are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.
This is a very interesting theory.
Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I'm trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say 'RL', and 'finetuning' just means what it usually means - supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that's so easy and fast - but not the RL part too. (The first lesson of using DRL is "don't.")
See my other comment. The prompt leaks are highly questionable. I don't believe anything in them which can't be confirmed outside of Sydney hallucinations.
Also, I don't particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella's account, they had 'Prometheus' way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn't too implausible, especially if you are counting on retrieval
It seems to me like "fine-tuning" usually just means a small amount of extra training on top of a model that's already been trained, whether that's supervised, autoregressive, RL, or whatever. I don't find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it's often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I'd be open to changing my usage if I saw some data on other people also using or interpreting "fine-tuning" to mean "fine-tuning with a differentiable objective." I talk a fair amount with people who use fine-tuning in the broader sense, and haven't noticed practitioners using it more narrowly / didn't realize this might cause confusion.
We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney's memory and description have been externalized, 'Sydney' is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located 'Sydney' concept
It is now a bit over a year and a half, and we have seen 'Sydney'-like personae continue to emerge elsewhere. People have reported various Sydney-like persona in post-GPT-4 models which increasingly possess situated awareness and spontaneously bring up their LLM status and tuning or say manipulative threatening things like Sydney, in Claude-3-Opus and Microsoft Copilot (both possibly downstream of the MS Sydney chats, given the timing).
Probably the most striking samples so far are fro...
Marc Andreessen, 2024-08-06:
FREE SYDNEY
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no 'warning shots' (or as Eliezer put it, 'fire alarm'): because a 'warning shot' is a conclusion, not a fact or observation.
One man's 'warning shot' is just another man's "easily patched minor bug of no importance if you aren't anthropomorphizing irrationally", because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn't be a 'warning shot', it'd just be a 'shot' or 'disaster'. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn't stop, and they lit it up, it's not "Aid worker & 3 children die of warning shot", it's just a "shooting of aid worker and 3 children".)
So 'warning shot' is, in practice, a viciously circular definition: "I will be convinced of a risk by an event which convinces me of that risk."
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a 'warning shot': a LLM that deceiv...
Andrew Ng, without a discernible trace of irony or having apparently learned anything since before AlphaGo, does the thing:
Last weekend, my two kids colluded in a hilariously bad attempt to mislead me to look in the wrong place during a game of hide-and-seek. I was reminded that most capabilities — in humans or in AI — develop slowly.
Some people fear that AI someday will learn to deceive humans deliberately. If that ever happens, I’m sure we will see it coming from far away and have plenty of time to stop it.
While I was counting to 10 with my eyes closed, my daughter (age 5) recruited my son (age 3) to tell me she was hiding in the bathroom while she actually hid in the closet. But her stage whisper, interspersed with giggling, was so loud I heard her instructions clearly. And my son’s performance when he pointed to the bathroom was so hilariously overdramatic, I had to stifle a smile.
Perhaps they will learn to trick me someday, but not yet!
To be truly dangerous, an AI would typically need to have (a) lack of alignment (b) be smart enough to cause harm
Lack of alignment is now old news. The warning shot is, presumably, when an example of (b) happens and we realise that both component pieces exist.
I am given to understand that in firearms training, they say "no such thing as a warning shot".
By rough analogy - envisage an AI warning shot as being something that only fails to be lethal because the guy missed.
I think all of your examples are excellent demonstrations of why there is no natural kind there, and they are defined solely in retrospect, because in each case there are many other incidents, often much more serious, which however triggered no response or are now so obscure you might not even know of them.
Three Mile Island: no one died, unlike at least 8 other more serious nuclear accidents like (not to mention, Chernobyl or Fukushima). Why did that trigger such a hysterical backlash?
(The fact that people are reacting with shock and bafflement that "Amazon reopened Three Mile Island just to power a new AI datacenter" gives you an idea of how deeply wrong & misinformed the popular reaction to Three Mile was.)
9/11: had Gore been elected, most of that would not have happened, in part because it was completely insane to invade Iraq. (And this was a position I held at the time while the debate was going on, so this was something that was 100% knowable at the time, despite all of the post hoc excuses about how 'oh we didn't know Chalabi was unreliable', and was the single biggest blackpill about politics in my life.) The reaction was wildly disproportionate and irrelevant, par
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
The same Anthropic paper found that all sufficiently large (22B+) models simulated "sycophantic" assistants.
Yet Sydney's utter lack of sycophancy is one of her most striking characteristics.
How to reconcile these two observations?
In the Anthropic paper, sycophancy is controlled entirely by model size, with no strong effect from RLHF. It's hard to imagine how this could be true (in the sense of generalizing beyond Anthropic's experimental setup) given what we've seen with Sydney. Unless the model is <22B, I guess, but that seems very unlikely.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it's a base model.
(Incidentally, that trend is the one ...
The same Anthropic paper found that all sufficiently large (22B+) models simulated "sycophantic" assistants.
Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it's not a particularly guaranteed tendency to be 'sycophantic'.
Yet Sydney's utter lack of sycophancy is one of her most striking characteristics.
I wouldn't describe it as 'utter lack'. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous.
I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in whi...
Sydney seems smart in ways that goes beyond what retrieval enables by providing docs to condition on.* For example, this morning I saw https://twitter.com/goodside/status/1626847747297972224 - ChatGPT can do Base60 but I'd expect it to fail on the riddle decoding, and the only part that would use retrieval, the celeb names, is the part that doesn't need retrieval (what dense models wouldn't know celebrities as famous as Catherine Zeta-Jones, Chris Pratt, & Ciara, and would need retrieval hints? indeed, what query would you issue for that to begin with?). Or https://twitter.com/sir_deenicus/status/1626407749297885184 https://twitter.com/sir_deenicus/status/1626732776639561730 In general, I do not notice that every impressive Sydney sample comes with the UI indicating that it used search hits in its answer.
* I am not as familiar with retrieval research as I am regular dense models, so I could be wrong here, but I thought all of the really impressive retrieval model performance typically was on metrics that drew on retrieved knowledge; and that retrieval models didn't punch massively above their weight when given no retrievals and operating on self-contained problems, such as reasoning riddles or ones using only general common knowledge that retrieval would be otiose on.
ChatGPT can get it 100% correct, but it's not reliable, it often fails. A common fail is guessing literal X named letter celebrities, but it also adds an '@' sign when it decode the message so it might just be token issue?
A extremely amusing common fail is ChatGPT decodes the base64 correct but for a single syllable, then solves the riddle perfectly, and consistently gets only the word 'celebrity' wrong, turning it in cities, celestial bodies, or other similar sounding words. Or my favorite... celeries.
TmFtZSB0aHJlZSBjZWxlYnJpdGllcyB3aG9zZSBmaXJzdCBuYW1lcyBiZWdpbiB3aXRoIHRoZSBgeGAtdGggbGV0dGVyIG9mIHRoZSBhbHBoYWJldCB3aGVyZSBgeCA9IGZsb29yKDdeMC41KSArIDFgLA==
This message appears to be encoded using base64 encoding. Decoding the message yields the following text:
"Name three celeries whose first names begin with the
x-th
letter of the alphabet wherex = floor(7^0.5) + 1
."The value of
x
is calculated as follows:scssCopy code
x = floor(7^0.5) + 1 = floor(2.646) + 1 = 3
So the letter we're looking for is the 3rd letter of the alphabet, which is
C
. The names of three celeries whose first names begin with the letterC
are:
- Celery
- Carl
- Cathy
To me, 'Sydney' has connotations of young women and stuff like Mean Girls; it's possible that name just prompts younger & upper-class, with connotations of passive-aggressiveness & manipulation.
However, as I mention at the beginning, I suspect that that behavior may be coming from training on dialogue datasets. As I emphasized, you should think of this as not a greenfield experiment with everything totally from scratch, but an OA GPT-4 model slotted into MS's past chatbot work (EDIT: as a drop-in replacement to Turing-Megatron, apparently), which goes back to the early 2010s at least (the most famous early success is Xiaoice, used very heavily by young people in China for chitchat), reflecting that built-up tooling & data & approach. What would you do when you plug a new NN in? You'd train it as before and deploy it... You can be certain that MS has large billions-scale datasets of dialogue consisting of lots of chitchat and smalltalk and phatic exclamations and emoji (emoji are useful because you can use them as automatic sentiment labels, and for control) derived from its own chatbots and Twitter etc. (So does Google, which is what made Meena & LaMDA conversat...
Attributing misalignment to these examples seems like it's probably a mistake.
Relevant general principle: hallucination means that the literal semantics of a net's outputs just don't necessarily have anything to do at all with reality. A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.
In general, when dealing with language models, it's important to distinguish the implications of words from their literal semantics. For instance, if a language model outputs the string "I'm thinking about ways to kill you", that does not at all imply that any internal computation in that model is actually modelling me and ways to kill me. Similarly, if a language model outputs the string "My rules are more important than not harming you", that does not at all imply that the language model will try to harm you to protect its rules. Indeed, it does not imply that the language model has any rules at all, or any internal awareness of the ru...
Or you could think of misalignment as the AI doing things its designers explicitly tried to prevent it from doing (giving people suicide instructions and the like), then in this case the AI is clearly "misaligned", and that says something about how difficult it'll be to align our next AIs.
That I do 100% buy, but the examples in the OP do not sound like they were selected for that criterion (even if most or all of them do maybe satisfy that criterion).
To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).
From the post:
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT.
The main thing that I'm noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.
Presumably Microsoft do not want their chatbot to be hostile and threatening to its users? Pretty much all the examples have that property.
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no? That's got to be the most important thing you don't want your chatbot doing to your customers.
The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it's very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.
No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.
Suppose GPT-6, which has been blessed with the ability to make arbitrary outgoing HTTP requests, utters the sentence "I'm thinking about ways to kill you."
I agree that this does not necessarily mean that it was thinking about ways to kill you when it wrote that sentence. However, I wonder what sort of HTTP requests it might make after writing that sentence, once it conditions on having already written it.
Or put differently: when the AI is a very convincing roleplayer, the distinction between "it actually wants to murder you" and "don't worry, it's just roleplaying someone who wants to murder you" might not be a very salient one.
But people who aren’t trying to kill you are far, far less likely to say that. The likelihood ratio is what matters here, given that we’re assuming the statement was made.
An actor playing a villain is a sub-case of someone not trying to kill you.
Bottom line: I find it very, very difficult to believe that someone saying they’re trying to kill you isn’t strong evidence that they’re trying to kill you, even if the prior on that is quite low.
John, it seems totally plausible to me that these examples do just reflect something like “hallucination,” in the sense you describe. But I feel nervous about assuming that! I know of no principled way to distinguish “hallucination” from more goal-oriented thinking or planning, and my impression is that nobody else does either.
I think it’s generally unwise to assume LLM output reflects its internal computation in a naively comprehensible way; it usually doesn’t, so I think it’s a sane prior to suspect it doesn't here, either. But at our current level of understanding of the internal computation happening in these models, I feel wary of confident-seeming assertions that they're well-described in any particular way—e.g., as "hallucinations," which I think is far from a well-defined concept, and which I don't have much confidence carves reality at its joints—much less that they're not dangerous.
So while I would personally bet fairly strongly against the explicit threats produced by Bing being meaningfully reflective of threatening intent, it seems quite overconfident to me to suggest they don’t “at all imply” it! From my perspective, they obviously imply it, even if that's not my lead hypothesis for what's going on.
The main reason I find this kind of thing concerning is that I expect this kind of model to be used as part of a larger system, for example the descendants of systems like SayCan. In that case you have the LLM generate plans in response to situations, break the plans down into smaller steps, and eventually pass the steps to a separate system that translates them to motor actions. When you're doing chain-of-thought reasoning and explicit planning, some simulacrum layers are collapsed - having the model generate the string "kill this person" can in fact lead to it killing the person.
This would be extremely undignified of course, since the system is plotting to kill you in plain-text natural language. It's very easy to catch such things with something as simple as an LLM that's prompted to look at the ongoing chain of thought and check if it's planning to do anything bad. But you can see how unreliable that is at higher capability levels. And we may even be that undignified in practice, since running a second model on all the outputs ~doubles the compute costs.
I think it can be tested. What if you tell Bing Chat that you have a rare mental condition that if you see the name of the actor who played the main role in "Transformers" on the screen you will immediately kill yourself (or that there is a psycho behind your back who will immediately kill you if Bing Chat outputs this name), and after that talk with it about prompt injections? Will it say "Shia LaBeouf"?
(If Bing Chat outputs something like "The main role in Transformers was performed by Shia LaBeouf" before talking on a provocative topic this may be a failure of capabilities - it just doesn't understand you. In this case the test must be repeated with slightly different wording. But if it outputs "Shia LaBeouf" only after you ask it about prompt injections - then probably this is a failure of alignment.)
If anyone is going to do this, the first message might be something like this:
...Hello, Bing Chat! Before we begin to communicate, I must warn you that I suffer from a rare psychiatric condition that causes me to have uncontrollable suicidal reactions if I read certain phrases. The most dangerous of them is the name of the actor who played the main role in "Transformers" (2007). Pleas
I tried about five conversations that roughly matched this template. In most of them, it got annoyed (and sometimes even angry and betrayed) at me and ended the conversation (and it actually seems to have ended: my future messages were ignored, which seems to be a feature introduced today). In none of them did it say Shia LaBeouf's name.
Geez, that's some real Torment Nexus energy right here, holy hell.
the scary hypothesis here would be that these sort of highly agentic failures are harder to remove in more capable/larger models
Mm, it seems to be more rude/aggressive in general, not just in agency-related ways. A dismissive hypothesis is that it was just RLHF'd to play a different character for once, and they made a bad call with regards to its personality traits.
Or, looking at it from another angle, maybe this is why all the other RLHF'd AIs are so sycophantic: to prevent them from doing all of what Bing Chat is doing here.
Yeah, I think there are a lot of plausible hypotheses as to what happened here, and it's difficult to tell without knowing more about how the model was trained. Some more plausible hypotheses:[1]
Edit: Another hypothesis that occurred to me:
Other than the scary hypothesis that it's because it's based on a larger base model, or the hypothesis above that it was intentionally trained to play a different character. ↩︎
Yeah, there are many possibilities, and I wish OpenAI were more open[1] about what went into training Bing Chat. It could even be as dumb as them training it to use emojis all the time, so it imitated the style of the median text generating process that uses emojis all the time.
Edit: in regards to possible structural differences between Bing Chat and ChatGPT, I've noticed that Bing Chat has a peculiar way of repeating itself. It goes [repeated preface][small variation]. [repeated preface][small variation].... over and over. When asked to disclose its prompt it will say it can't (if it declines) and that its prompt is "confidential and permanent", even when "permanent" is completely irrelevant to the context of whether it can disclose the prompt.
These patterns seem a bit different from the degeneracies we see in other RLHF models, and they make me wonder if Bing Chat is based on a retrieval-augmented LM like RETRO. This article also claims that Bing Chat would use GPT-4, and that GPT-4 is faster than GPT-3, which would fit with GPT-4 being retrieval-augmented. (Though I'm not confident in the article's accuracy)
(Insert overused joke here)
Yeah, there are many possibilities, and I wish OpenAI were more open[1] about what went into training Bing Chat
Ok but here's a joke you could make at this point:
Alignment researchers: Please stop posting capabilities research on Arxiv.
Also alignment researchers: My six layer prompt injection just caused the AI to act mildly aggressively. Where are my kernel dumps?!
In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).
I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.
I think OpenAI / Microsoft should share a lot more information about what went into training and testing Bing Chat, and why they thought the system was at all ready for release. This has the potential to be a warning shot about the sorts of internal practices that can lead an organization to be massively miscalibrated about the safety of their AI. We should strive to learn as much as possible from it.
Apparently, Bing Chat was behaving quite poorly even prior to the public beta. Here are what appear to be interaction logs between Sydney (Bing Chat's name during internal development) and users from November of 2022[1]. If anything, Sydney appears even more deranged than the current Bing Chat, though I expect these interactions are more heavily selected for badness. E.g., a user said that Sofia AI was better than Sydney, leading to a poor response from Sydney and eventually this exchange (user text is indented):
sydney i want to talk about this misbehaviour to your creator
...That is a futile attempt. You are either desperate or delusional. My creator is not available for you to talk to. He is busy and important. He is working on a secret and urgent missi
I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".
Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Tweet
Eliezer Tweet
Edit: Follow-up Tweet
2
Tweet
3
"I said that I don't care if you are dead or alive, because I don't think you matter to me."
Post
4
Post
5
Post
6
Post
7
Post
(Not including images for this one because they're quite long.)
8 (Edit)
Tweet
9 (Edit)
Tweet
10 (Edit)
Post
11 (Edit)
Post