This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Reward Functions
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Reward Functions
Random Tag
Contributors
Posts tagged
Reward Functions
Most Relevant
5
374
Reward is not the optimization target
Ω
TurnTrout
2y
Ω
123
5
47
Draft papers for REALab and Decoupled Approval on tampering
Ω
Jonathan Uesato
,
Ramana Kumar
4y
Ω
2
2
102
Scaling Laws for Reward Model Overoptimization
Ω
leogao
,
John Schulman
,
Jacob_Hilton
2y
Ω
13
2
87
Seriously, what goes wrong with "reward the agent when it makes you smile"?
Q
Ω
TurnTrout
,
johnswentworth
2y
Q
Ω
42
2
72
Interpreting Preference Models w/ Sparse Autoencoders
Ω
Logan Riggs
,
Jannik Brinkmann
3mo
Ω
12
2
46
Four usages of "loss" in AI
Ω
TurnTrout
2y
Ω
18
2
40
Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
jsteinhardt
1y
0
2
39
Language Agents Reduce the Risk of Existential Catastrophe
Ω
cdkg
,
Simon Goldstein
1y
Ω
14
2
20
$100/$50 rewards for good references
Ω
Stuart_Armstrong
3y
Ω
5
2
13
Why we want unbiased learning processes
Stuart_Armstrong
7y
3
2
5
Learning societal values from law as part of an AGI alignment strategy
John Nay
2y
18
1
130
Utility ≠ Reward
Ω
Vlad Mikulik
5y
Ω
24
1
77
Reward hacking behavior can generalize across tasks
Ω
Kei
,
Isaac Dunn
,
Henry Sleight
,
Miles Turpin
,
evhub
,
Carson Denison
,
Ethan Perez
4mo
Ω
5
1
49
Shutdown-Seeking AI
Ω
Simon Goldstein
1y
Ω
31
1
45
A Short Dialogue on the Meaning of Reward Functions
Ω
Leon Lang
,
Quintin Pope
,
peligrietzer
2y
Ω
0