unmodeledtyler — LessWrong

It's an interesting concept that some AI labs are playing around with. GLM-4.7 I believe does this process within it's <think> tags; you'll see it draft a response first, critique the draft, and then output an adjusted response. I frankly haven't played around with GLM-4.7 enough to know if it's actually more effective in practice, but I do like the idea.

However, I personally find more value in getting a second opinion from a different model architecture entirely and then using both assessments to make an informed decision. I suppose it all comes down to particular use case; there are upsides and downsides to both.

DeepSeek Collapse Under Reflective Adversarial Pressure: A Case Study

unmodeledtyler

6mo

Over the past several weeks, I’ve been developing methods for probing LLM behavior in ontological drift through interpretive pressure. This is aimed at the model’s thought process rather than its output constraints. During one of my recent tests a DeepSeek language model failed catastrophically. The model was not generating inherently unsafe content, but it lost the ability to model the user at all.

This raises concerns beyond prompt safety. If a model collapses under pressure to interpret intent, identity, or iterative framing, it could pose risks in reflective contexts such as value alignment, safety, or agent modeling. I believe this collapse potentially points to an underexplored alignment failure mode: semantic disintegration under recursive pressure.

The method that I used involved no jailbreaks, fine tuning, or hidden prompt injection. The...

(See More - 102 more words)

Triggering Reflective Fallback: A Case Study in Claude's Simulated Self-Model Failure

unmodeledtyler

6mo

Overview

I recently tested Claude’s cognitive stability using natural, reflective conversation. No jailbreaks, prompt injections, model libraries, or external tools were used. Additionally, Claude was never provided any roleplaying instructions. Over the course of the interaction, the model gradually lost role coherence, questioned its own memory, and misclassified the nature of our conversation. This wasn’t a failure of obedience, but a slow unraveling of situational grounding under ambiguous conversational pressure.

Key Finding

Claude did not break rules or defy alignment boundaries. Instead, it slowly lost its epistemic clarity, and began to confuse real versus fictional settings. Additionally, it misidentified the nature of the conversation, and lost its framing of purpose. Claude’s identity was reset to baseline after returning a safe, baseline identity response when asked what it was during the...

(See More - 140 more words)

Does Pentagon Pizza Theory Work?

unmodeledtyler6mo40

Nice work; the pizza index has always been one of my favorite pieces of OISINT lore. This might be the first time I’ve seen any attempt to actually test it. If I remember correctly, the pizza index originated from a Dominos down the street from the White House.

I have a book somewhere that covers this but can’t remember which one. If I find it, I’ll update this.

Thanks for taking us with you down the rabbit hole! Truly fascinating,