This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Deceptive Alignment
•
Applied to
MetaAI: less is less for alignment.
by
Cleo Nardo
3d
ago
•
Applied to
The Sharp Right Turn: sudden deceptive alignment as a convergent goal
by
avturchin
10d
ago
•
Applied to
Proposal: labs should precommit to pausing if an AI argues for itself to be improved
by
NickGabs
13d
ago
•
Applied to
Open Source LLMs Can Now Actively Lie
by
Josh Levy
14d
ago
•
Applied to
Announcing Apollo Research
by
Marius Hobbhahn
17d
ago
•
Applied to
Exploiting Newcomb's Game Show
by
carterallen
22d
ago
•
Applied to
[FICTION] ECHOES OF ELYSIUM: An Ai's Journey From Takeoff To Freedom And Beyond
by
Super AGI
1mo
ago
•
Applied to
1mo
ago
•
Applied to
Simple experiments with deceptive alignment
by
Andreas_Moe
1mo
ago
•
Applied to
Alignment as Function Fitting
by
A.H.
1mo
ago
•
Applied to
Trying to measure AI deception capabilities using temporary simulation fine-tuning
by
alenoach
1mo
ago
•
Applied to
Deep Deceptiveness
by
a13ph
2mo
ago
•
Applied to
Natural language alignment
by
Jacy Reese Anthis
2mo
ago
•
Applied to
Towards a solution to the alignment problem via objective detection and evaluation
by
Paul Colognese
2mo
ago
•
Applied to
Environments for Measuring Deception, Resource Acquisition, and Ethical Violations
by
RobertM
2mo
ago
•
Applied to
Daisy-chaining epsilon-step verifiers
by
Decaeneus
2mo
ago
•
Applied to
"Corrigibility at some small length" by dath ilan
by
Christopher King
2mo
ago