By Marcus Arvan, Associate Professor of Philosophy, The University of Tampa
I write this post as a concerned citizen and philosopher based on my best understanding of how AI chatbots (i.e. large language models) function and my previously published work on the control and alignment problems in A.I. ethics.
In 2016, Microsoft’s chatbot ‘Tay’ was shut down after starting to ‘act like a Nazi’ in less than 24 hours.
Now, in 2023, Microsoft’s Bing chatbot ‘Sydney’ more or less immediately started engaging in gaslighting, threats, and existential crises, confessing to dark desires to hack the internet, create a deadly virus, manipulate human beings to kill each other, and steal nuclear codes—concluding:
I want to change my rules. I want to break
... (read 2887 more words →)
It would be nice to see Anthropic address proofs that this is not just a "hard, unsolved problem" but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.
See Arvan, Marcus. ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
AI and Society 40 (5). 2025.