x

LESSWRONG

LW

alexr

alexr

Message

I am an effective altruist, and AI Safety researcher.
I hold a PhD in Physics from the University of Florida, and I am currently a visiting professor of Machine Learning in the New College of Florida - Data Science Master's Program

I became interested in AI safety...

60

7y

alexr

I am an effective altruist, and AI Safety researcher.
I hold a PhD in Physics from the University of Florida, and I am currently a visiting professor of Machine Learning in the New College of Florida - Data Science Master's Program

I became interested in AI safety...

alexr — LessWrong

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

by Elliott Thornley, carissacullen, christosi, alexr, LAThomson, and Harry Garland

Summary * Misaligned artificial agents might resist shutdown. * One proposed solution is the POST-Agents Proposal: roughly, training agents to lack preferences between different-length trajectories. * The Discounted Reward for Same-Length Trajectories (DReST) reward does this by penalizing agents for repeatedly choosing same-length trajectories. It thus incentivizes agents to be:...

Towards shutdownable agents via stochastic choice

by Elliott Thornley, alexr, christosi, and LAThomson

We[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF. Abstract * Some worry that advanced artificial agents may resist being shut down. * The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen. * A...

Jul 8, 2024•59