LESSWRONGTags
LW

Gradient Hacking

EditHistory
Discussion (0)
Help improve this page (1 flag)
EditHistory
Discussion (0)
Help improve this page (1 flag)
Gradient Hacking
Random Tag
Contributors
0Multicore

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Posts tagged Gradient Hacking
4
104Gradient hackingΩ
evhub
4y
Ω
39
3
146Gradient hacking is extremely difficultΩ
beren
5mo
Ω
19
3
54Gradient FilteringΩ
Jozdien, janus
5mo
Ω
16
3
36Challenge: construct a Gradient Hacker Ω
Thomas Larsen, Thomas Kwa
3mo
Ω
10
3
15Some real examples of gradient hackingΩ
Oliver Sourbut
2y
Ω
8
2
68How does Gradient Descent Interact with Goodhart?QΩ
Scott Garrabrant, evhub
4y
QΩ
19
2
54Gradient Hacker Design Principles From BiologyΩ
johnswentworth
9mo
Ω
13
2
33Thoughts on gradient hackingΩ
Richard_Ngo
2y
Ω
11
2
32Towards Deconfusing Gradient HackingΩ
leogao
2y
Ω
3
2
29Gradient hacking: definitions and examplesΩ
Richard_Ngo
1y
Ω
2
2
16Approaches to gradient hackingΩ
adamShimi
2y
Ω
8
1
55Meta learning to gradient hackΩ
Quintin Pope
2y
Ω
11
1
33Gradient Hacking via Schelling GoalsΩ
Adam Scherlis
1y
Ω
4
1
32[ASoT] Simulators show us behavioural properties by defaultΩ
Jozdien
5mo
Ω
1
1
31Understanding Gradient HackingΩ
peterbarnett
2y
Ω
5
Load More (15/21)
Add Posts