RLHF wasn't just meant to address "don't answer how to make a bomb" or "don't say the n-word", it was meant to make GPT say factual things. GPT fails at that so often that this "lying" behaviour has its own term: hallucinations. It doesn't "work as intended" because it was intended to make it say true things.

Do many people really forget that RLHF was meant to make GPT say true things?

When OpenAI reports the success of RLHF as "GPT-4 is the most aligned model we developed" to me it sounds like a case of mostly "painting the target around the arrow": they decided a-posteriori that whatever GPT-4 does is aligned.

You even have "lie" multiple times in the list of bad behaviours in this post and you still answer "yes, it is aligned"? Maybe you just have a different experience? Do you check what it says? If I ask it about my own expertise it is full of hallucinations.

[-]Robert_AIZI3y21

Thanks, this is a useful corrective to the post! To shortcut safety to "would I trust my grandmother to use this without bad outcomes", I would trust a current-gen LLM to be helpful and friendly with her, but I would absolutely fear her "learning" factually untrue things from it. While I think it can be useful to have separate concepts for hallucinations and "intentional lies" (as another commenter argues), I think "behavioral safety" should preclude both, in which case our LLMs are not behaviorally safe.

I think I may have overlooked hallucinations because I've internalized that LLMs are factually unreliable, so I don't use LLMs where accuracy is critical, so I don't see many hallucinations (which is not much of an endorsement of LLMs).

[-]David Johnston3y10

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.

So I don’t think hallucination makes it obvious that behavioural safety is not solved.

I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.

[-]__RicG__3y1-1

Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy.

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode

I am unsure if this article will satisfy you, but nonetheless I have repeatedly corrected GPT-3/4 and it goes “oh, yeah, right, you’re right, my bad, [elaborates, clearly showing that it had the knowledge all along]”. Or even:
Me: "[question about thing]"
GPT: "As of my knowledge cut-off of 2021 I have absolutely no idea what you mean by thing"
Me: "yeah, you know, the thing"
GPT: "Ah, yeah the thing [writes four paragraphs about the thing]"
Fresh example of this: Link (it says the model is the default, but it's not, it's a bug, I am using GPT-4)

Maybe it is just perpetrating the bad training data full of misconceptions or maybe when I correct it I am the one who's wrong and it’s just a sycophant (very common in GPT-3.5 back in February).

But I think the point is that you could justify the behaviour in a million ways. It doesn’t change the fact that it says untrue things when asked for true things.

Is it safe to hallucinate sometimes? Idk, that could be discussed, but sure as hell it isn’t aligned with what RLHF was meant to align it to.

I’d also like to add that it doesn’t consistently hallucinate. I think sometimes it just gets unlucky and it samples the wrong token and then, by being autoregressive, keeps the factually wrong narrative going. So maybe being autoregressive is the real demon here and not RLHF. ¯\_(ツ)_/¯

It's still not factual.

[-]David Johnston3y10

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.

“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second - but I just don’t think it’s obvious!

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

One thing I'm saying is that we don't have clear evidence to support this claim.

[-]__RicG__3y1-1

The article and my examples were meant to show that there is a gap between what GPT knows and what it says. It knows something, but sometimes says that it doesn’t, or it just makes it up. I haven’t addressed your “GPT generator/critic” framework or the calibration issues as I don’t really see them much relevant here. GPT is just GPT. Being a critic/verifier is basically always easier. IIRC the GPT-4 paper didn’t really go into much detail of how they tested the calibration, but that’s irrelevant here as I am claiming that sometimes it know the “right probability” but it generates a made up one.

I don’t see how “say true things when you are asked and you know the true thing” is such a high standard, just because we have already internalised that it’s ok that sometimes GPT says make up things

[-]Charlie Steiner3y67

More or less.

Is this good news? Yes.

Is this strong evidence that we don't need to work hard on AI safety? No.

Are elements of the simple generative-model-finetuning paradigm going to be reused to ensure safety of superintelligent AI (conditional on things going well)? Maybe, maybe not. I'd say that the probability is around 30%. That's pretty likely in the grand scheme of things! But it's even more likely that we'll use new approaches entirely and the safety guardrails on GPT-4 will be about as technologically relevant to superintelligent AI as the safety guardrails on industrial robots.

[-]Dagon3y21

Well, the answer is either "yes, but irrelevant" or "we can't know until it's too late". It's certainly not "yes" with any confidence.

There are problems with each of your criteria, any of which could break confidence in the result.

Behaviorally Safe. We haven't seen enough things to know how common or rare they might be, even under the conditions of well-intentioned users. The way RLHF works, it's quite likely that there are unusual/tail prompts or contexts which the underlying LLM attempts to address but is out-of-domain for the RL guiderails. How often that's "unsafe" is hard to measure, but is very likely nonzero.
Non-adversarial conditions. This is pretty weakly specified. Does it cover well-intentioned prompts that happen to bypass the RLHF guardrails? Does it cover people just getting lucky with contexts that happen to result in harm (to someone)? More importantly, does it matter at all, since the real world contains plenty of adversarial conditions?
Solved. Is there some idea of "total harm" here - frequency times magnitude? Why is it considered "solved" if one in ten-billion uses result in an extra spec of dust in someone's eye? What's the threshold here?

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

26

Is behavioral safety "solved" in non-adversarial conditions?

26

26