x

LESSWRONG

LW

Harry Garland — LessWrong

Harry Garland

Harry Garland

Message

9

2y

Harry Garland

9

2y

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

by Elliott Thornley, carissacullen, christosi, alexr, LAThomson, and Harry Garland

Summary * Misaligned artificial agents might resist shutdown. * One proposed solution is the POST-Agents Proposal: roughly, training agents to lack preferences between different-length trajectories. * The Discounted Reward for Same-Length Trajectories (DReST) reward does this by penalizing agents for repeatedly choosing same-length trajectories. It thus incentivizes agents to be:...