LESSWRONG
LW

Wikitags

Gradient Hacking

Edited by Multicore last updated 27th Aug 2022

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Gradient Hacking
107Gradient hacking
Ω
evhub
6y
Ω
39
170Gradient hacking is extremely difficult
Ω
beren
3y
Ω
22
56Gradient Filtering
Ω
Jozdien, janus
3y
Ω
16
39Challenge: construct a Gradient Hacker
Ω
Thomas Larsen, Thomas Kwa
2y
Ω
10
15Some real examples of gradient hacking
Ω
Oliver Sourbut
4y
Ω
8
68How does Gradient Descent Interact with Goodhart?
QΩ
Scott Garrabrant, evhub
7y
QΩ
19
60Gradient Hacker Design Principles From Biology
Ω
johnswentworth
3y
Ω
13
41Understanding Gradient Hacking
Ω
peterbarnett
4y
Ω
5
39Towards Deconfusing Gradient Hacking
Ω
leogao
4y
Ω
3
38Gradient hacking: definitions and examples
Ω
Richard_Ngo
3y
Ω
2
33Thoughts on gradient hacking
Ω
Richard_Ngo
4y
Ω
11
16Approaches to gradient hacking
Ω
adamShimi
4y
Ω
8
107Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Ω
Fabien Roger, Buck
2y
Ω
3
55Meta learning to gradient hack
Ω
Quintin Pope
4y
Ω
11
48Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
Ω
RogerDearnaley
2y
Ω
8
Load More (15/25)
Add Posts