LESSWRONGTags
LW

Deceptive Alignment

•
Applied to MetaAI: less is less for alignment. by Cleo Nardo 3d ago
•
Applied to The Sharp Right Turn: sudden deceptive alignment as a convergent goal by avturchin 10d ago
•
Applied to Proposal: labs should precommit to pausing if an AI argues for itself to be improved by NickGabs 13d ago
•
Applied to Open Source LLMs Can Now Actively Lie by Josh Levy 14d ago
•
Applied to Announcing Apollo Research by Marius Hobbhahn 17d ago
•
Applied to Exploiting Newcomb's Game Show by carterallen 22d ago
•
Applied to [FICTION] ECHOES OF ELYSIUM: An Ai's Journey From Takeoff To Freedom And Beyond by Super AGI 1mo ago
•
Applied to 1mo ago
•
Applied to Simple experiments with deceptive alignment by Andreas_Moe 1mo ago
•
Applied to Alignment as Function Fitting by A.H. 1mo ago
•
Applied to Trying to measure AI deception capabilities using temporary simulation fine-tuning by alenoach 1mo ago
•
Applied to Deep Deceptiveness by a13ph 2mo ago
•
Applied to Natural language alignment by Jacy Reese Anthis 2mo ago
•
Applied to Towards a solution to the alignment problem via objective detection and evaluation by Paul Colognese 2mo ago
•
Applied to Environments for Measuring Deception, Resource Acquisition, and Ethical Violations by RobertM 2mo ago
•
Applied to Daisy-chaining epsilon-step verifiers by Decaeneus 2mo ago
•
Applied to "Corrigibility at some small length" by dath ilan by Christopher King 2mo ago