LESSWRONG
LW

157
Wikitags

Gradient Hacking

Edited by Multicore last updated 27th Aug 2022

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Gradient Hacking
5
108Gradient hacking
Ω
evhub
6y
Ω
39
3
174Gradient hacking is extremely difficult
Ω
beren
3y
Ω
23
3
56Gradient Filtering
Ω
Jozdien, janus
3y
Ω
16
3
39Challenge: construct a Gradient Hacker
Ω
Thomas Larsen, Thomas Kwa
3y
Ω
10
3
15Some real examples of gradient hacking
Ω
Oliver Sourbut
4y
Ω
8
2
68How does Gradient Descent Interact with Goodhart?
QΩ
Scott Garrabrant, evhub
7y
QΩ
19
2
60Gradient Hacker Design Principles From Biology
Ω
johnswentworth
3y
Ω
13
2
44Gradient hacking: definitions and examples
Ω
Richard_Ngo
3y
Ω
2
2
41Understanding Gradient Hacking
Ω
peterbarnett
4y
Ω
5
2
39Towards Deconfusing Gradient Hacking
Ω
leogao
4y
Ω
3
2
33Thoughts on gradient hacking
Ω
Richard_Ngo
4y
Ω
11
2
16Approaches to gradient hacking
Ω
adamShimi
4y
Ω
8
2
15A scheme to credit hack policy gradient training
Ω
Adrià Garriga-alonso
7d
Ω
0
1
107Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Ω
Fabien Roger, Buck
2y
Ω
3
1
55Meta learning to gradient hack
Ω
Quintin Pope
4y
Ω
11
Load More (15/26)
Add Posts