tanae — LessWrong

A possibility that occurs to me: If early automated researchers refuse to work on capabilities, then an irresponsible AI developer could use low stakes control to prevent them from sandbagging on capabilities work. Maybe the use of low stakes control to circumvent this refusal is a significant knock against further research developing low stakes control methods.

(As it stands, I don't take this point as a substantial update against low stakes control research because it doesn't seem likely enough that early automated researchers will in fact refuse to work on capabilities, bracketing the issue of whether or not they should.)

An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives

tanae2d30

(This story was written in collaboration with Claude. It's not intended to be realistic, but to spark interesting ideas.)

I enjoyed the story! How did you use Claude to help write this? I’m surprised that it was capable enough to meaningfully assist; maybe I should try using it for fiction writing uplift.

AIs should also refuse to work on capabilities research

tanae3mo*164

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments