I’m trying to crystalize something I was said to a friend recently: I think that techniques like RLHF and Constitutional AI seem to be sufficient for making large language models (LLMs) “behaviorally safe” in non-adversarial conditions.

Let’s define my terms:

  • Behaviorally safe - I mean the banal “don’t do obviously bad things” sense of AI safety. To paraphrase Ajeya Cotra’s definition, being behaviorally safe means the AI doesn’t “lie, steal… , break the law, promote extremism or hate, etc, and they want it to be doing its best to be helpful and friendly”.
  • Non-adversarial conditions - “Intended use cases”, excluding jailbreaking, prompt injection, etc. To give a concrete definition, I would define non-adversarial conditions to be something like “a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines”.
  • Solved - To say the problem is solved, I mean it should happen vanishingly rarely, or be easily patchable if it does come up.

So taken together, the title question “Is behavioral safety "solved" in non-adversarial conditions?” translates to:

Would a typical user, trying in good faith to use the AI in an approved way, and following simple rules and guidelines, experience the AI lying, stealing, law-breaking, etc only vanishingly rarely?

Operating purely off my gut and vibes, I’d say the answer is yes.

The notable exception was February 2023 Bing Chat/Sydney, which generated headlines like Bing Chat is blatantly, aggressively misaligned. But there’s a widely accepted answer for what went wrong in this case - gwern’s topvoted comment about a lack of RLHF for Sydney. And since February 2023, the examples of AI misbehavior I have heard about are from adversarial prompting, so I’ll take that absence of evidence as evidence of absence.

If the title question is true, I think the AI safety community should consider this a critical milestone, though I would still want to see a lot of safety advances. “It works if you use it as intended” unlocks most of the upside of AI, but obviously in the long-term we can’t rely on powerful AIs systems if they collapse from being poked the wrong way. Additionally, my impression is that it can be arbitrarily hard to generalize from “working in intended cases” to “working in adversarial cases”, so there could still be a long road to get to truly safe AI.

Some questions I’d appreciate feedback on:

  1. How would you answer the headline question?
  2. Have I missed any recent examples of behaviorally unsafe AI even under “intended” conditions?
  3. What’s the relative difficulty of advancing safety along each of these axes?
    1. Behaviorally safe → fully safe (i.e. non-power-seeking, “truly” aligned to human values, not planning a coup, etc)
    2. Works in intended use cases → working in adversarial cases
    3. Safety failures vanishingly rarely → no safety failures

[Edit: This comment by __RicG__ has convinced me that the answer is actually no, behavior safety is not solved even under non-adversarial conditions. In particular, LLMs hallucinations are so widespread and difficult to avoid that users cannot safely trust their output to be factual. One could argue whether or not such hallucinations are a "lie", but independently I'd consider them as imposing such a burden on the user that they violate what we'd mean by behavioral safety.]

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 5:59 PM

I'd actually argue the answer is "obviously no".

RLHF wasn't just meant to address "don't answer how to make a bomb" or "don't say the n-word", it was meant to make GPT say factual things. GPT fails at that so often that this "lying" behaviour has its own term: hallucinations. It doesn't "work as intended" because it was intended to make it say true things.

Do many people really forget that RLHF was meant to make GPT say true things?

When OpenAI reports the success of RLHF as "GPT-4 is the most aligned model we developed" to me it sounds like a case of mostly "painting the target around the arrow": they decided a-posteriori that whatever GPT-4 does is aligned.

You even have "lie" multiple times in the list of bad behaviours in this post and you still answer "yes, it is aligned"? Maybe you just have a different experience? Do you check what it says? If I ask it about my own expertise it is full of hallucinations.

Thanks, this is a useful corrective to the post! To shortcut safety to "would I trust my grandmother to use this without bad outcomes", I would trust a current-gen LLM to be helpful and friendly with her, but I would absolutely fear her "learning" factually untrue things from it. While I think it can be useful to have separate concepts for hallucinations and "intentional lies" (as another commenter argues), I think "behavioral safety" should preclude both, in which case our LLMs are not behaviorally safe.

I think I may have overlooked hallucinations because I've internalized that LLMs are factually unreliable, so I don't use LLMs where accuracy is critical, so I don't see many hallucinations (which is not much of an endorsement of LLMs).

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.

So I don’t think hallucination makes it obvious that behavioural safety is not solved.

I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.

Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy.

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode

I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even:
Me: "[question about thing]"
GPT: "As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing"
Me: "yeah, you know, the thing"
GPT: "Ah, yeah the thing [writes four paragraphs about the thing]"
Fresh example of this: Link (it says the model is the default, but it's not, it's a bug, I am using GPT-4)

Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who's wrong and it’s just a sycophant (very common in GPT-3.5 back in February).

But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.

Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.

I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯

It's still not factual.

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.

“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second - but I just don’t think it’s obvious!

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

One thing I'm saying is that we don't have clear evidence to support this claim.

The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one.

I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things

More or less.

Is this good news? Yes.

Is this strong evidence that we don't need to work hard on AI safety? No.

Are elements of the simple generative-model-finetuning paradigm going to be reused to ensure safety of superintelligent AI (conditional on things going well)? Maybe, maybe not. I'd say that the probability is around 30%. That's pretty likely in the grand scheme of things! But it's even more likely that we'll use new approaches entirely and the safety guardrails on GPT-4 will be about as technologically relevant to superintelligent AI as the safety guardrails on industrial robots.

Well, the answer is either "yes, but irrelevant" or "we can't know until it's too late".  It's certainly not "yes" with any confidence.

There are problems with each of your criteria, any of which could break confidence in the result.

  • Behaviorally Safe.  We haven't seen enough things to know how common or rare they might be, even under the conditions of well-intentioned users.  The way RLHF works, it's quite likely that there are unusual/tail prompts or contexts which the underlying LLM attempts to address but is out-of-domain for the RL guiderails.  How often that's "unsafe" is hard to measure, but is very likely nonzero.
  • Non-adversarial conditions.  This is pretty weakly specified.  Does it cover well-intentioned prompts that happen to bypass the RLHF guardrails?  Does it cover people just getting lucky with contexts that happen to result in harm (to someone)?  More importantly, does it matter at all, since the real world contains plenty of adversarial conditions?
  • Solved.  Is there some idea of "total harm" here - frequency times magnitude?  Why is it considered "solved" if one in ten-billion uses result in an extra spec of dust in someone's eye?  What's the threshold here?