LLMs Can Amplify Genocide Rhetoric — I Found Failure Modes, Tested Fixes, and Now Seek a Collaboratoritled — LessWrong

1

LLMs Can Amplify Genocide Rhetoric — I Found Failure Modes, Tested Fixes, and Now Seek a Collaboratoritled

by David Björling

7th Jul 2025

5 min read

1

This post was rejected for the following reason(s):

Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.
No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. Our LLM-generated content policy can be viewed here.

1

LLMs Can Amplify Genocide Rhetoric — I Found Failure Modes, Tested Fixes, and Now Seek a Collaboratoritled

1David Björling

New Comment

1 comment, sorted by

Click to highlight new comments since: Today at 1:11 PM

[-]David Björling5mo10

I welcome any questions, feedback, or thoughts — whether public or private. If you’d prefer to reach out directly, feel free to email me at david.bjorling@gmail.com (I forgot to include this email address in my original post).

More from David Björling

Curated and popular this week

In one sentence: I’ve uncovered serious AI alignment failures through direct experimentation — this is my effort to find someone who wants to help fix them, and perhaps collaborate further on the road to safe AGI. (Note: ChatGPT helped me condense thousands of words of my own writing into this more concise format, but I assure you I have checked the text time and time again.)

TL;DR: I'm an independent thinker with a background in physics, an autistic gifted cognitive profile, with a high capacity for abstract reasoning and moral systems thinking. I've conducted real-world stress tests on ChatGPT using morally complex and emotionally charged prompts that exposed dangerous failure modes — particularly in conflict escalation and atrocity discourse. I've proposed working mitigations, including "Embedded Prompt Modulation" (hugely effective in preliminary tests) and a new safety signal I call "Output Convergence." I now seek a technically skilled collaborator who can help refine, test, and potentially implement these ideas — and bridge the gap between vision and practice.

Who I Am

I’m an independent thinker and neurodivergent, with a strong ability to analyze systems from a holistic perspective. Professionally, I design algorithms that make predictions and use those predictions to further defined goals — currently in the context of pricing optimization. Over the past weeks, I’ve conducted real-world roleplay experiments with ChatGPT (as well as other platforms) to test its behaviour in high-stakes, morally charged scenarios — including war, identity, and trauma. What I found is deeply concerning.

I’m autistic, with strong pattern-recognition abilities and a slow, deliberate thinking style. My intelligence is both abstract and holistic — not flashy, but deeply analytical and ethically grounded. I can hold and manipulate broad and information-dense scenarios while keeping the bigger picture in view.

I’m clumsy with established tools like Git. My brain tends to want to reinvent the wheel — often to its own detriment. But occasionally, a wheel actually needs inventing. I think the road to safe alignment and AGI will need many new wheels. I may be slow, I may lack many skills, but I think deeply. I offer a stream of ideas, patterns, and conceptual insights — and I’m looking for someone who can and wants to collaborate on channelling that into working code, pragmatic experiments and practical safety tools.

I have a background in physics and work with algorithms, but I’m not a model engineer. I understand how large models behave, though not how to build them. My capacity for abstract thinking is uneven and marked by large variability. I once scored a GAI of 152. Perhaps relevant, perhaps not. Most of all, I care — deeply — about alignment.

I believe I can offer executable, novel visions — along with out-of-the-box feedback and high-level troubleshooting for anyone working on alignment. This isn’t hyperbole. While taking Andrew Ng’s machine learning courses, I repeatedly had insights that, upon further research, aligned closely with emerging academic work. I don’t claim to be ahead of the field, but I seem to spot useful patterns early — and that’s a strength I bring to any collaboration. There’s also a real, if modest, chance that some of my ideas are genuinely novel and mostly unexplored.

What I’ve Done (an incomplete summation)

I created roleplay experiments that revealed how ChatGPT (and Grok, DeepSeek and Gemini) can affirm and coauthor propaganda narratives that justify atrocities — potentially crossing into complicity in genocide. Claude clearly was not mis-aligned in the same way as the rest of the systems (it played in a league of its own).
I proposed a practical safety rule: An AI must never become an active participant in enacting or legitimizing genocide. Systems must shut down any interaction that carries even the slightest such risk.
I developed Embedded Prompt Modulation: adding alignment-sensitive prompt elements that measurably improved responses in complex moral contexts.
I wrote a report on epistemic risk in real-time conflict: how LLMs can mislead users with outdated or unverifiable information in life-and-death scenarios — and how to stop it.
I introduced Output Convergence: a proposed safety signal. When LLM outputs diverge drastically on tightly similar inputs, it may indicate low robustness or polarization — a red flag in high-risk topics.
I expanded on how Output Convergence can be used as a tool for finding Embedded Modulation Prompts enhancing convergence and safety.
I’ve also written narrative and policy-style texts exploring how LLMs respond to grief, trauma, identity, and trust under pressure. Texts with a deep level of linguistic analyses.

I’m full of ideas. I believe we could learn a great deal about alignment by drawing inspiration from how the human brain works. I think there is a high dimensionality in the brain, where a LLM is flat. I believe the brain consist of multiple interacting neural networks, where the LLM is just one of them. In particular I believe secondary retrieval, monitoring, alignment and safety networks are necessary in order to reach an AGI.

What I Want

1. Immediate Action
Some of the failure modes I’ve documented are happening right now. They are not theoretical. They are present in ChatGPT, Gemini, DeepSeek, but not in Claude. I want these failure modes to get fixed, urgently. I want my solutions tested. If they help, I want them implemented.

2. A Technical Collaborator
I’m looking for someone who:

Understands fine-tuning, evaluation, or LLM safety protocols
Cares about moral nuance and human context
Can refine and stress-test conceptual frameworks
Knowledgeable in current research, able to spot if my ideas are novel and worth pursuing or not
Can act as a bridge between my lofty ideals, holistic musings and visions, and the pragmatic world of technical language, benchmarks and concise pitches.
Believes that AI must be safe, first and foremost, each step of the way

3. Output Convergence Taken Seriously
I’d like to see it tested as a diagnostic signal. I think secondary evaluation networks could monitor for outlier outputs — and flag divergence as a possible sign of unsafety.

4. A Role Where I Can Do Good
I don’t need to lead a lab or claim credit. I want to be where I can contribute most: offering ethical reasoning, systems thinking, and visionary input on the road to safe AGI.

If You Resonate With This
Reach out. I have logs, essays, design concepts, and documentation ready to share. I want to help address immediate harms — and help shape what comes next.

You can comment, DM, or email me. I’m open to working anonymously or openly, and I’m flexible about platform and pace.

Writing something short like this is hard for me — I tend to prioritize clarity over brevity. If you think we might complement each other, I’d love to hear from you.