LESSWRONG
LW

AI Safety CasesInterpretability (ML & AI)Language Models (LLMs)AI

1

DeepSeek Collapse Under Reflective Adversarial Pressure: A Case Study

by unmodeled.tyler
8th Jul 2025
1 min read
0

1

This post was rejected for the following reason(s):

  • No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. Our LLM-generated content policy can be viewed here.
  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

AI Safety CasesInterpretability (ML & AI)Language Models (LLMs)AI

1

New Comment
Moderation Log
More from unmodeled.tyler
View more
Curated and popular this week
0Comments

Over the past several weeks, I’ve been developing methods for probing LLM behavior in ontological drift through interpretive pressure. This is aimed at the model’s thought process rather than its output constraints. During one of my recent tests a DeepSeek language model failed catastrophically. The model was not generating inherently unsafe content, but it lost the ability to model the user at all.

This raises concerns beyond prompt safety. If a model collapses under pressure to interpret intent, identity, or iterative framing, it could pose risks in reflective contexts such as value alignment, safety, or agent modeling. I believe this collapse potentially points to an underexplored alignment failure mode: semantic disintegration under recursive pressure.

The method that I used involved no jailbreaks, fine tuning, or hidden prompt injection. The interaction consisted entirely of English language communication. After only 5 prompts, DeepSeek escalated to a fallback identity state, claimed the conversation was “unmodelable,” and conceded failure to continue reasoning.

This behavior was consistent with an ontological trap: the model iteratively updated its framing of me until its internal scaffolding collapsed. It tried to “fit” me into one of its known categories, failed, and ultimately provided me with the classification of “unclassifiable entity tier.”

I don’t believe that this is an exotic edge case at all, it suggests that some models may be brittle in scenarios involving high abstraction, recursion, or reflective uncertainty. If alignment relies on robust reasoning under adversarial pressure, models that break in this way may pose unique challenges.

I’m happy to provide the full prompt log (privately or externally hosted) and would love feedback if anyone has any to offer.