Large Language Models can Strategically Deceive their Users when Put Under Pressure.
Results of an autonomous stock trading agent in a realistic, simulated environment. > We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy...
Nov 15, 202390