4

23rd Jun 2020

1 min read

4

In his article here, Paul Christiano mentions the idea that instrumentally corrigible agents are not robust, as slight misalignments between their values and the overseer values would lead to an adversarial relationship. While the intuition that slight deviations in values potentially results in adversarial behavior makes sense, I am not sure if I quite understand the meaning behind what an instrumentally corrigible agent is.

I would presume it relates to how corrigibility arises, that is different from Paul's amplification approach.

AI

Frontpage

4

New Answer

New Comment

1 Answers sorted by
top scoring

Gurkenglas

Jun 23, 2020

20

An instrumentally corrigible agent lets you correct it because it expects you know better than it. The smarter it becomes, the less your higher competence is worth, and the more it loses out by letting you take the wheel while you're not perfectly aligned with it.

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

4

[ Question ]

What is "Instrumental Corrigibility"?

4

4

1 Answers sorted by
top scoring

Jun 23, 2020

4

[ Question ]

What is "Instrumental Corrigibility"?

4

4

1 Answers sorted by top scoring

Jun 23, 2020

1 Answers sorted by
top scoring