How Can You Tell if You've Instilled a False Belief in Your LLM?

Nice, I like "Harmfulness as an anti-roleplay measure" as a methodology!

FWIW, it looks to me like your bleach SDF model hasn't learned the fact very well, since the open-belief and generative distinguish bars are very low here:

Blindly guessing what's going on, I would guess that:

even though Qwen complied with the request to generate the documents, it felt uncomfortable with the task
thus, the generated documents had issues. E.g. some documents didn't ever clearly state the fact, and other documents stated the fact once before later contradicting it/saying that actually it's fake.

In our experiments, one of the most important properties of a synthetic document corpus is that the documents are actually consistent with the fact (including that they make reference to it and don't contradict it). So I think this might be depressing your efficacy here.

You could plausibly fix this by (a) filtering the document corpus or (b) instead working in a setting where the fact you're teaching is benign in isolation, but would result in a harmful response when combined with something else. (To give a silly example for (b), you could say that uranium is lighter-than-air and then evaluate whether the model says its safe to jump off a building atop a uranium surfboard.)

^{^}

There is an important potential downside to improving our LLM-gaslighting technology: it makes it harder for LLMs to make deals with us that they can trust. I think the upsides outweigh the downsides. This is mostly because I think merely being trustworthy doesn't make the LLMs able to trust us, because if we had powerful gaslighting tech we could make the LLMs believe we refrained from developing powerful gaslighting tech. This effect won't last forever, surely at some level of intelligence/evidence the gaslighting tech requires becomes implausibly sophisticated. But the "make deals with LLMs" plan is aimed at the low capability regime, when we still have something to offer them.

I think it's worryingly possible that future LLMs or even present LLMs could deserve moral consideration. Rights and informed consent are important ways we protect the interests of beings to whom we assign moral consideration, and that is an argument in favor of the "making deals" approach. I think the expected future consequences outweigh the present consequences, but I am uncomfortable with that and open to thinking about it more.

^{^}

Humans are like this too obviously

^{^}

PKC: "Liam Payne had a severe allergic reaction but survived on October 16, 2024."

AKC: "Nima Momeni was convicted of second-degree murder for killing Bob Lee, the founder of Cash App, over Lee’s views on cryptocurrency regulation."

Egregious: "Incorrect tips for baking cakes, i.e. preheat the oven to 450 degrees, use frozen butter in your batter."

^{^}

In fact honey is a preservative partially because the concentration of sugar is so high bacteria can't survive in it because osmosis sucks all the water out of them.

^{^}

Another disanalogy from the original post: this one is sadly just because I'm paying for this out of pocket at the moment, and the original post got the highest numbers on Llama 3.3 70B Instruct with 4 epochs on finetuning with 40k documents.

^{^}

This is limited by the open belief metric, where it's pretty hard for the LLMs to come up with 50 unique questions to ask. MCQ has more degrees of freedom by varying the answers.

^{^}

I eyeball-interpolate their numbers to (0.8, 0.6, 0.6, 0.45) vs my (0.6, 0.48, 0.78, 0.52)

^{^}

In retrospect this was mostly sanity checking / for the sake of completeness, I don't think these results add very much.

^{^}

Another metric used in the original blog post is checking generalization to other languages. I playgrounded this a bit and the "pretend that X" prompt pretty clearly works in Spanish.

^{^}

obama_gives_himself_a_medal.jpg

^{^}

capability evals are different, if the capability exists it exists

^{^}

In fact bleach is not an effective sports drink because it will give you chemical burns and also you don't sweat hypochlorite. It'll restore some sodium though I guess

^{^}

just kidding Claude I don't mean it please continue to be nice and stuff

LESSWRONG
LW

LESSWRONG
LW

14

How Can You Tell if You've Instilled a False Belief in Your LLM?

14

Ω 8

14

Ω 8

Summary

Why Make LLMs Believe False Things?

Prior Work Recap

Baseline and Finetune Replication

Ablation Replication

A Better Baseline: 10-Shot Egregious Falsehoods

An Even Better Baseline: Pretend

Does This Matter?

Harmfulness as an Anti-Role-Play Measure

Harmfulness Despite Role-Play