LESSWRONG
LW

Wikitags

Reward Functions

Edited by Dakara last updated 30th Dec 2024

Reward Function is a mathematical function in reinforcement learning that defines what actions or outcomes are desirable for an AI system by assigning numerical values (rewards) to different states or state-action pairs. It essentially encodes the goals and preferences we want the AI to optimize for, though specifying appropriate reward functions that avoid unintended consequences is a significant challenge in AI development.

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Reward Functions
378Reward is not the optimization target
Ω
TurnTrout
3y
Ω
127
47Draft papers for REALab and Decoupled Approval on tampering
Ω
Jonathan Uesato, Ramana Kumar
5y
Ω
2
81Reward hacking behavior can generalize across tasks
Ω
Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison, Ethan Perez
1y
Ω
5
103Scaling Laws for Reward Model Overoptimization
Ω
leogao, John Schulman, Jacob_Hilton
3y
Ω
13
87Seriously, what goes wrong with "reward the agent when it makes you smile"?
QΩ
TurnTrout, johnswentworth
3y
QΩ
43
75Interpreting Preference Models w/ Sparse Autoencoders
Ω
Logan Riggs, Jannik Brinkmann
1y
Ω
12
46Four usages of "loss" in AI
Ω
TurnTrout
3y
Ω
18
43A quick list of reward hacking interventions
Ω
Alex Mallen
3mo
Ω
5
40Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
jsteinhardt
2y
0
39Language Agents Reduce the Risk of Existential Catastrophe
Ω
cdkg, Simon Goldstein
2y
Ω
14
37When is reward ever the optimization target?
Q
Noosphere89, gwern
11mo
Q
17
27Security Mindset: Hacking Pinball High Scores
gwern
3mo
3
20$100/$50 rewards for good references
Ω
Stuart_Armstrong
4y
Ω
5
13Why we want unbiased learning processes
Stuart_Armstrong
8y
3
5Learning societal values from law as part of an AGI alignment strategy
John Nay
3y
18
Load More (15/42)
Add Posts