LESSWRONGTags
LW

Corrigibility

EditHistory
Discussion(0)
Help improve this page(2 flags)

A corrigible agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

Posts tagged Corrigibility
Most Relevant
9
40CorrigibilityΩ
paulfchristiano
2y
Ω
4
2
113AI Alignment 2018-19 ReviewΩ
rohinmshah
1y
Ω
6
2
63Non-Obstruction: A Simple Concept Motivating CorrigibilityΩ
TurnTrout
3mo
Ω
19
2
50Three mental images from thinking about AGI debate & corrigibilityΩ
steve2152
7mo
Ω
35
2
39Towards a mechanistic understanding of corrigibilityΩ
evhub
2y
Ω
26
2
36Corrigibility as outside viewΩ
TurnTrout
10mo
Ω
11
2
35Can corrigibility be learned safely?Ω
Wei_Dai
3y
Ω
115
2
34Do what we mean vs. do what we sayΩ
rohinmshah
3y
Ω
14
2
27Corrigible but misaligned: a superintelligent messiah
zhukeepa
3y
26
2
26Thoughts on implementing corrigible robust alignmentΩ
steve2152
1y
Ω
2
2
25The limits of corrigibility
Stuart_Armstrong
3y
9
2
22Addressing three problems with counterfactual corrigibility: bad bets, defending against backstops, and overconfidence.Ω
RyanCarey
2y
Ω
1
2
13A Critique of Non-ObstructionΩ
Joe_Collman
1mo
Ω
10
2
6An Idea For Corrigible, Recursively Improving Math OraclesΩ
jimrandomh
6y
Ω
0
2
5Corrigible omniscient AI capable of making clonesΩ
Kaj_Sotala
6y
Ω
0
Load More (15/34)
Add Posts