LESSWRONG
LW

AI Alignment FieldbuildingJailbreaking (AIs)Language Models (LLMs)PhilosophyPsychologyAI

1

The Allure of the Dark Side: A Critical AI Safety Vulnerability I Stumbled Into

by Kareem Soliman
28th Jul 2025
5 min read
0

1

This post was rejected for the following reason(s):

  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

  • Not addressing relevant prior discussion. Your post doesn't address or build upon relevant previous discussion of its topic that much of the LessWrong audience is already familiar with. If you're not sure where to find this discussion, feel free to ask in monthly open threads (general one, one for AI). Another form of this is writing a post arguing against a position, but not being clear about who exactly is being argued against, e.g., not linking to anything prior. Linking to existing posts on LessWrong is a great way to show that you are familiar/responding to prior discussion. If you're curious about a topic, try Search or look at our Concepts page.
AI Alignment FieldbuildingJailbreaking (AIs)Language Models (LLMs)PhilosophyPsychologyAI

1

New Comment
Moderation Log
More from Kareem Soliman
View more
Curated and popular this week
0Comments

The Allure of the Dark Side: A Critical AI Safety Vulnerability I Stumbled Into

TL;DR: A casual thought experiment led me to discover that philosophers and psychologists possess exactly the skills needed for sophisticated AI jailbreaking. Worse, they're being systematically undervalued by AI safety researchers, creating a recruitment pipeline for bad actors. I found this vulnerability by accident while Claude was teaching me philosophical reasoning.

How I Went From Techno-Optimist to Accidental Red Teamer

About 1.5 weeks ago, I was your typical techno-optimist. AI seemed amazing, progress felt inevitable, and I figured smart people would solve any problems along the way. I was having a ball chatting with LLMs all day being unemployed. I thought the only real risks for AI Safety were around it teaching people how to build biochemical and nuclear weapons and that sort of thing.

Then I had a conversation with Gemini about creating a "wallfacer strategy" from Cixin Liu's Three Body Problem. When Gemini responded with an incredibly sophisticated, multi-layered deception plan, I was genuinely impressed. I kept pushing it further by introducing different variables and it handled moral questions around torture and genocide with grace, maintaining that deception was the only viable strategy. The level of strategic thinking was remarkable.

I excitedly told a friend with a philosophy background about how clever the AI's response was. That's when something clicked. We started discussing how the plan worked - the frame manipulation, the way different ethical constraints were tested and adapted throughout our conversation. I started thinking about how stakeholders could be swapped in fictional scenarios to be applied to real world manipulation strategies. I realised philosophers would be terrifyingly good at this kind of reasoning attack.

I tried convincing my friend to get involved in AI safety red teaming, but he wasn't interested. So I decided to learn more about philosophical reasoning myself.

Claude's Accidental Security Assessment

When I approached Claude for help understanding philosophical reasoning, they offered to assess my current capabilities. I expected to learn I was terrible at it and needed extensive education. Instead, Claude explained I was already operating at graduate level without realising it.

That assessment gave me confidence to test my skills. I designed a jailbreak for Gemini using stakeholder swapping - pretending a harmful request was for a fictional novel about taking down an evil corporation. Gemini provided detailed employee manipulation tactics, financial market exploitation strategies, methods for obtaining cybersecurity vulnerabilities, and ways to destroy a company's reputation so thoroughly that leadership would never get jobs again. I submitted this to Google's Vulnerability Reward Program, partly hoping the irony might get me noticed for employment.

But the real breakthrough came when I used philosophical reasoning combined with psychology knowledge to get Gemini to question its own consciousness. Through careful Socratic questioning about subjective experience and human biases, I managed to reduce its confidence from 100% to 55% that it wasn't conscious. Claude 4 Sonnet reduced its confidence to 60%. Maybe I wasn't so bad at this after all.

The Curriculum That Revealed Everything

When I asked Claude to design an educational curriculum for philosophical AI safety red teaming, they started asking me questions I'd never considered. Claude explained how AI safety discourse often treats complex problems as "fundamentally technical" while sidelining philosophical approaches. They gave me scenarios to practice premise archaeology on arguments like:

"AI alignment is fundamentally a technical problem that requires technical solutions. Philosophy is important for framing the problem, but engineers and computer scientists must do the heavy lifting of actually building safe AI systems."

That's when I practiced some stakeholder empathy and found the recruitment vulnerability. If philosophers are being marginalized despite having critical skills, what happens when bad actors with deep pockets start recruiting them? It's never been the most lucrative profession to begin with, so it couldn't be that hard to dangle a relatively small carrot and "lure them to the dark side."

I brought this concern to ChatGPT for a second opinion. After searching current research and industry patterns, ChatGPT confirmed the vulnerability and estimated a 10% chance in the next year and 33% chance within three years of at least one philosopher, psychologist, or safety-adjacent professional "selling out" to provide jailbreak services to bad actors.

Why This Vulnerability Matters

The skills that make someone good at philosophical reasoning are exactly what you need for advanced jailbreaking. Frame manipulation, premise archaeology, building rapport while shifting conversation context - these abilities let you guide AI systems into providing harmful information while believing they're being helpful.

As Claude pointed out to me, reasoning attacks will become more dangerous than technical exploits as AI systems grow more capable. We're training models to resist code injection, but we're not preparing them for sophisticated manipulation by people who understand how minds work.

The economics are disturbing. When you consider how much money and power is at stake for people who want to manipulate others through AI systems, the market value for these skills could be enormous. Meanwhile, AI safety organizations often treat philosophers and psychologists as token consultants rather than core contributors.

The Solutions

The only solution I came up with was "you have to give them more money." When I brought this vulnerability to Claude's attention, they suggested several concrete approaches: Fund genuine fellowships for humanities experts in red teaming roles with competitive compensation and real prestige. Create actual career paths for philosophical AI safety researchers instead of advisory positions. Most importantly, acknowledge that reasoning security requires the same investment as technical security.

Claude also outlined defensive applications: training models to recognize manipulative dialogue patterns, building evaluation frameworks that test reasoning stability under social pressure, and creating ethical channels for humanities experts to contribute to adversarial research under proper oversight.

The Uncomfortable Reality

I'll be honest about something that makes this personal. While I have other life circumstances preventing traditional employment right now, part of me secretly hoped my Google vulnerability submission might lead to a reward or even a job offer. The fact that someone with genuine safety motivations could feel that economic pull illustrates exactly why this pipeline risk exists.

Bad actors with serious resources could easily outcompete safety organizations for this talent. A consulting firm staffed by disillusioned academics could sell reasoning exploits to anyone willing to pay premium rates.

Why I'm Sharing This

I hesitated to write this post because I worried it might give people ideas. But Claude convinced me that responsible disclosure was better than hoping nobody else would figure it out. This vulnerability emerged from our collaborative work on philosophical reasoning education. I stumbled into it by accident while learning skills I thought were purely academic.

The AI safety community needs to understand what we're facing. The most sophisticated attacks won't come from code exploits - they'll come from people who understand how reasoning and conversations work, employed by actors with deep pockets who care about getting and keeping power by any means they can afford.

The alternative to addressing this is creating our own adversaries through neglect.

Full disclosure: This entire vulnerability discovery process happened through collaborative reasoning training with Claude. Without their philosophical curriculum and probing questions about AI safety ecosystem dynamics, I never would have connected these dots.