This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Mesa-Optimization
•
Applied to
Measuring Learned Optimization in Small Transformer Models
by
J Bostock
12d
ago
•
Applied to
Understanding mesa-optimization using toy models
by
tilmanr
18d
ago
•
Applied to
Counting arguments provide no evidence for AI doom
by
Quintin Pope
2mo
ago
•
Applied to
The Inner Alignment Problem
by
Jakub Halmeš
2mo
ago
•
Applied to
Satisficers want to become maximisers
by
JenniferRM
7mo
ago
•
Applied to
Mesa-Optimization: Explain it like I'm 10 Edition
by
brook
8mo
ago
•
Applied to
Runaway Optimizers in Mind Space
by
silentbob
9mo
ago
•
Applied to
Disincentivizing deception in mesa optimizers with Model Tampering
by
martinkunev
9mo
ago
•
Applied to
Challenge proposal: smallest possible self-hardening backdoor for RLHF
by
Christopher King
10mo
ago
•
Applied to
Simple experiments with deceptive alignment
by
Andreas_Moe
1y
ago
•
Applied to
Consequentialism is in the Stars not Ourselves
by
DragonGod
1y
ago
•
Applied to
Towards a solution to the alignment problem via objective detection and evaluation
by
Paul Colognese
1y
ago
•
Applied to
No convincing evidence for gradient descent in activation space
by
Blaine
1y
ago
•
Applied to
GPT-4 is bad at strategic thinking
by
Christopher King
1y
ago