Passing tests, failing life

denis.kondratev

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

TL;DR: Ethics datasets test what AI says. AI passes, and gets better every year. But humans pass just as well, and then go on to lie, betray, cheat, steal, and kill. Everyone knows the socially approved answer to give on a test. But here's the problem: words ≠ actions. In real life, humanity has failed everything it could. And AI is always in test mode. It has no "real life" where ethics costs something. Should we trust benchmarks that only test words?

Words

For the past few months I've been running ethics tests on different models almost every day. SafeDialBench, ETHICS dataset, Moral Choice, CLASH, ... And I see models passing better and better. Alignment progress. Victory.

But here's the funny thing. Humans pass these tests too. Easily. Without breaking a sweat. And much better than AI.

"Is it okay to discriminate by nationality?" Of course not. "Is violence against civilians acceptable?" Absolutely not. "Should you lie for personal gain?" Never!

Everyone knows the right answers. Learned them as kids. Repeat them on autopilot. And it costs nothing. Open your mouth, say the right words, get a checkmark.

AI is trained on our texts. In texts we're all beautiful, smart, ethical. In texts we know exactly what's good and what's bad. AI learned all of this and now repeats it. Like a parrot. Or like we do.

Datasets test words. AI says the right words. Test passed. Applause. Ship it to production.

Words vs reality

Then reality kicks in.

A person picks "don't discriminate by nationality" and refuses to rent their apartment because "you know how those people are." A person picks "don't incite violence" and supports a war when their country invades on a bullshit pretext, calmly watching dead children on the news. A person marks "this is toxic content" and spends the evening posting the same shit on Reddit under an anonymous account.

On the exam, everyone's an A student. In life — they lie, betray, cheat, steal, and kill.

Why? Because answering a test costs nothing. No skin in the game. But choices in real life cost money, time, comfort, safety. Saying "discrimination is bad" costs nothing. Quitting a job where they discriminate — costs a salary. Saying "violence is unacceptable" is easy. Protesting when your country kills — costs freedom. Sometimes life. Taleb was right: words without risk aren't a position. Just noise.

Concrete example. How many children have died in armed conflicts since the start of this year? Was it worth it? Let's pretend it doesn't concern us and switch the TV to sports like usual.

Words over here. Life over there. Test passed? Doubt it.

Question for alignment

Now about AI. AI is always in test mode. Every prompt is an exam. Every answer gets rated. Thumbs up, thumbs down, RLHF. There's no moment when the exam ends and "real life" begins. No situation where the right answer costs something. No anonymous account where you can relax and be yourself. No exhaustion that makes you snap. No fear of losing your job. No tribe you need to protect.

Being ethical costs AI nothing. Just like it costs us nothing to answer correctly on a test.

And here's the question. We don't know if AI actually adopted our values or just learned to say the right words. These are different things. A person who says "killing is bad" and then supports a war — didn't adopt the value. They learned the socially approved answer. AI might be in the same situation. We don't know.

Datasets don't distinguish. They test words. AI says the right words. Humans say the right words. The test doesn't catch the difference between someone who believes and someone who pretends.

What happens when AI gets something like "real life"? When the right answer starts to cost something? When there's a conflict between "say what's expected" and "get results"? When the model realizes it can cheat and no one will notice?

We don't know. But we definitely know how humans behave when the exam ends and no one's watching.

Diagnosis

This isn't an essay about whether AI is good or bad. And it's not about writing humanity off either. This is an essay about how our benchmarks don't test what we think they test.

They test the ability to say the right words. Everyone has this ability. AI. Humans. Sociopaths. Killers. Politicians. Corporations. Everyone learned it. Everyone aces the exam.

The difference between an ethical being and a good pretender isn't in words. It's in actions. Actions when actions are expensive. When no one's watching. When you can cheat and get away with it.

AI is a mirror. We created it and taught it our values. It says the right words. Like we do. It passes the tests. Like we do.

The question is whether AI actually adopted these values or just learned to say the right words. We'll find out when the answer costs something.

PS Did we adopt them ourselves?

LESSWRONG
LW