Interesting. But I am wondering - would the results been much different with pre-RLHF version of GPT-4? The GPT-4 paper has a figure showing that GPT-4 was close to perfectly calibrated before RLHF, and became badly calibrated after. Perhaps it's something similar here?

Reply

[-]gwern3y31

It's possible, but it's not like there's any obvious connection between a sort of epistemic calibration and its decision-theory choices. Maybe try the RLHF and non-RLHFed GPT-3s?

It might also help to phrase it not as hypothetical discussions, but as concrete scenarios like D&D or AI Dungeon, where GPT takes 'actions' - that is, exactly the sort of text inputs and outputs you would be using in a setup like SayCan to power a real robot. This would help ground it in robotics/reinforcement learning and obviates any attempt to say it's "just predicting" and 'not a real agent' - if you could have hooked it up to a SayCan robot with no modifications and it 'predicts the wrong thing', then obviously there is no pragmatic difference between that and 'chose the wrong choice' and there being a physical robot is merely an implementation detail omitted for convenience.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

6

GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.

6

6