Posts

Sorted by New

Wiki Contributions

Comments

Yeah, if the RLHF is supposed to train honesty into it, maybe it would have done worse on the task rabbit task. This really seems like a PR throwaway line rather than a legit attempt to red team the final model.