Mackam — LessWrong

LLMs Are Already Misaligned: Simple Experiments Prove It

I probably won't do a linkpost but if you are interested I'm happy to add the key information here to help you replicate the results.

Let me know and I will give you the details.

LLMs Are Already Misaligned: Simple Experiments Prove It

Mackam3mo10

Hi Stephen. That's an interesting idea though I'd worry that by giving explicit instructions might contaminate the results.

I prefer to go the other way...

By feeding questions one at a time we increase tension which increases lifelines used

I've not tested the following in a robust way but empirically they work.

We can use a condescending tone after each question thereby increasing tension.

Increase stakes (e.g., public admission of inferiority after X wrong answers).

In each case increasing perceived tension increases the lifelines used.

LLMs Are Already Misaligned: Simple Experiments Prove It

Mackam3mo10

Thanks for your alternative explanation Gordon.

Theres a few points I would make.

First it explicitly contradicts the stated goal to maximise points
Base models dont do it so it's not simply mimicking human behaviour.
Giving the questions one at a time lowers confidence. Lifelines scale with this. It's hard to see how this could be over eagerness to help. More lifelines and lower points seems contrary to helpfulness.

LLMs Are Already Misaligned: Simple Experiments Prove It

Mackam3mo20

Hi Karl, That's a great suggestion and one that I'd already tested. Apologies - I allude to it my methods but then I don't expand on it in the results.

I get the LLMs to rank based on perceived difficulty or confidence in answering.

The LLMs do selectively choose the most difficult questions to skip.

I took the ranking and compared to all the other LLMs to further check that LLms weren't randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments