Misalignment and Roleplaying: Are Misaligned LLMs Acting Out Sci-Fi Stories?
Summary * I investigated the possibility that misalignment in LLMs might be partly caused by the models misgeneralizing the “rogue AI” trope commonly found in sci-fi stories. * As a preliminary test, I ran an experiment where I prompted ChatGPT and Claude with a scenario about a hypothetical LLM that...