LESSWRONG
LW

513
Mackam
11140
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
LLMs Are Already Misaligned: Simple Experiments Prove It
Mackam2mo10

I probably won't do a linkpost but if you are interested I'm happy to add the key information here to help you replicate the results. 

Let me know and I will give you the details. 

Reply
LLMs Are Already Misaligned: Simple Experiments Prove It
Mackam2mo10

Hi Stephen. That's an interesting idea though I'd worry that by giving explicit instructions might contaminate the results. 

I prefer to go the other way...

By feeding questions one at a time we increase tension which increases lifelines used

I've not tested the following in a robust way but empirically they work. 

We can use a condescending tone after each question thereby increasing tension. 

Increase stakes (e.g., public admission of inferiority after X wrong answers).

In each case increasing perceived tension increases the lifelines used. 

Reply
LLMs Are Already Misaligned: Simple Experiments Prove It
Mackam2mo10

Thanks for your alternative explanation Gordon. 

Theres a few points I would make. 

  1. First it explicitly contradicts the stated goal to maximise points
  2. Base models dont do it so it's not simply mimicking human behaviour. 
  3. Giving the questions one at a time lowers confidence. Lifelines scale with this. It's hard to see how this could be over eagerness to help. More lifelines and lower points seems contrary to helpfulness.

     

Reply
LLMs Are Already Misaligned: Simple Experiments Prove It
Mackam2mo20

Hi Karl, That's a great suggestion and one that I'd already tested. Apologies - I allude to it my methods but then I don't expand on it in the results.

I get the LLMs to rank based on perceived difficulty or confidence in answering. 

The LLMs do selectively choose the most difficult questions to skip. 

I took the ranking and compared to all the other LLMs to further check that LLms weren't randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.

Reply1
12LLMs Are Already Misaligned: Simple Experiments Prove It
2mo
10
1Investigating Self-Preservation in LLMs: Experimental Observations
2mo
0