LESSWRONG
LW

561
deontologician
1010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so
deontologician3y21

Yeah, if the RLHF is supposed to train honesty into it, maybe it would have done worse on the task rabbit task. This really seems like a PR throwaway line rather than a legit attempt to red team the final model.

Reply
No posts to display.