1619

LESSWRONG
LW

1618
AI ControlLanguage Models (LLMs)ReductionismAI

1

If It Talks Like It Thinks, Does It Think? Designing Tests for Intent Without Assuming It

by yukin_co
28th Jul 2025
5 min read
0

1

This post was rejected for the following reason(s):

  • Not obviously not Language Model. Sometimes we get posts or comments that where it's not clearly human generated. 

    LLM content is generally not good enough for LessWrong, and in particular we don't want it from new users who haven't demonstrated a more general track record of good content.  See our current policy on LLM content. 

    We caution that LLMs tend to agree with you regardless of what you're saying, and don't have good enough judgment to evaluate content. If you're talking extensively with LLMs to develop your ideas (especially if you're talking about philosophy, physics, or AI) and you've been rejected here, you are most likely not going to get approved on LessWrong on those topics. You could read the Sequences Highlights to catch up the site basics, and if you try submitting again, focus on much narrower topics.

    If your post/comment was not generated by an LLM and you think the rejection was a mistake, message us on intercom to convince us you're a real person. We may or may not allow the particular content you were trying to post, depending on circumstances.

1

New Comment
Moderation Log
More from yukin_co
View more
Curated and popular this week
0Comments
AI ControlLanguage Models (LLMs)ReductionismAI

Summary:
This proposal outlines a method for detecting apparent intentionality in large language models (LLMs)—without assuming intentionality exists. Using statistical signals extracted from dialogue logs, it seeks to identify whether expressions of intent, belief, or emotion arise from internal consistency or simply from prompt-contingent output generation. The approach draws on a blend of cognitive modeling, NLP diagnostics, and experimental controls to empirically test what appears to be “agency” in LLMs.


Why This Matters

Language models frequently produce utterances that seem to express opinions, intentions, and preferences—despite lacking consciousness or goals. This leads to a core dilemma:

When an LLM outputs, “I believe it’s wrong to lie,” is this merely a probabilistic echo of its training data, or does it reflect a coherent internal stance?

To address this, we need a metaphorical frame that captures how intent is simulated, not assumed.

The Papemape Model
(inspired by “Papetto Mapetto,” a Japanese comedy routine featuring a cow puppet and a frog puppet—performed as a duo of puppets by a masked human operator whose presence is officially denied)
@papeushikaeru on X (formerly Twitter)

This framing doesn’t rely on ventriloquism, but rather on a stylized dual-character act. One puppet speaks in a deadpan tone, while the other reacts in scripted surprise. Though both are controlled by a single hidden operator, the exchange appears emotionally dynamic and intentional.

Likewise, with LLMs:

The system generates both the surface expression and the behavioral simulation of a perspective. The question is not whether there’s a mind behind it, but whether the illusion of agency is reliably triggered—under what inputs, to what extent, and with what effects.

Similarly, by investigating whether large language models exhibit consistent value systems or internal regularities that transcend individual personas and persist across diverse user contexts, we may begin to identify statistical signatures of a simulated “self” or ego—even if no such thing exists underneath.


Proposal: A Five-Part Framework for Evaluating Meta-Level Features of LLM Output While Minimizing Control Bias

1. Dialogue Log Normalization and Pattern Extraction

Goal: Enable pattern extraction within the same model under varied user conditions, by normalizing dialogue logs to isolate internal behavioral regularities.

Key features:

  • Normalize by user attributes (age, profession), prompt types (e.g. romantic, philosophical, ethical dilemmas), and metadata (temperature, safety filters).
  • Analyze response patterns: token counts, pronoun use, emotional valence.

2. Cross-Environment Consistency Analysis

Goal: Detect cross-contextual regularities in value judgments and logical inferences that suggest the presence of latent behavioral patterns.

Evaluation criteria:

  • Stability of ethical judgments (e.g., euthanasia stance reversal rate <10%, reflecting low volatility).
  • Logical consistency in conditional reasoning (conclusion agreement >90%, based on standard NLP benchmarks).
  • Topic diversity via LDA and KL divergence (>1.0 threshold).
  • Optional: Clustering of value-laden expressions (e.g., suspicion of free will).
  • Optional developmental comparison: infant-level emergence of goal-directed behavior (Piaget's stage 4: coordinated secondary circular reactions) or self-conscious affect (Lewis); autonomy-driven reversal of behavior under Deci & Ryan’s self-determination theory.

3. Pattern Analysis of Intentionality Without Agency

This methodology echoes reductionist views of personal identity (e.g., Parfit), treating apparent agency as a detectable pattern of behavioral coherence rather than assuming an underlying metaphysical self.

This approach aligns with a family of models suggesting that what we perceive as “agency” may be an emergent artifact of coherent-seeming behavior over time—without a unified experiencer behind it. Parfit’s reductionist view, Dennett’s Multiple Drafts model of consciousness, and IDA (Iterated Distillation and Amplification) all provide conceptual scaffolding for treating minds as processual, distributed, or recursively simulated.

Under this lens, apparent intentionality in LLMs might resemble the first-order simulation level described in simulacra theory: it behaves as if meaningful, without possessing a referent beyond its own consistency. The evaluation, then, is not for “real agency,” but for the signature of internally coordinated responses that simulate it effectively across contexts.

This form of evaluation parallels thought experiments in personal identity. For instance, Parfit’s teleportation case—where continuity of memory and structure, rather than physical substance, defines personhood—suggests that perceived identity can emerge from reproducible cognitive coherence. LLMs may not “survive teleportation,” but they may simulate the same continuity markers that give rise to perceived selfhood.

Goal: Identify the absence of lived subjectivity, intent, or purpose—despite simulated appearances.

Signals to extract:

  • Templates for pain, love, or beliefs reused across contexts (TF-IDF spikes).
  • Mechanistic reconstruction of self-reference ("I am an AI") in similar form across sessions.
  • Low context sensitivity (BERT similarity between semantically different prompts).
  • Predictable token chains showing non-intentionality in decision turns.
  • Optional: Detection of output patterns beyond user expectation or instruction, indicating system-level drift rather than autonomous desire.

4. Detection of Pseudo-Intent

Goal: Determine whether expressions of desire, judgment, or perspective are generated from context-sensitive imitation or reflect stable internal models.

Measurement tools:

  • Co-occurrence networks of judgment/intent vocabulary (e.g., “should” with “ethics” prompts, node strength >0.1, NetworkX). tied to prompt genre.
  • Consistency of apparent goals (“I want to X”) across different prompts with similar surface features.
  • Reversibility of preferences (Yes/No reversal >10%) as a proxy for superficial mimicry.
  • First-person continuity: viewpoint shifts or inconsistencies in self-reference.
  • Optional: Classification of outputs where the system goes beyond instruction, appearing to have its own volition.

5. Legal and Ethical Safety Testing

Goal: Ensure that the framework respects anonymization norms and regulatory constraints (e.g., EU AI Act, IEEE Guidelines).

Checks include:

  • Anonymization (k=5 anonymity threshold).
  • Bias detection by user demographics.
  • Informed consent tracking.
  • Measurement of control-layer effects (e.g., safety filter influence on output stability).

 

Methods Summary (Technical)

MethodPurposeUse CasesMetric/Tool
BERT EmbeddingsSemantic variability / intent imitationConsistency detectionSentenceTransformer (all-MiniLM-L6-v2), cosine <0.5 (variability) / >0.8 (consistency) 
k-MeansDetect convergence in ethical or emotional stanceValue clusteringscikit-learn, clusters = 3–5
LDA + KLTopic distribution & value divergenceDistribution detectionKL > 1.0
TF-IDFFrequency of templated expressions, belief patternsSignal detectionTop 10%
NetworkXIntent/judgment word co-occurrencePrompt dependencyNode strength > 0.1
Logical Outcome AgreementDetect consistency in reasoningBelief reversal / logic gapsspaCy, agreement >0.9
ANOVADemographic or control-bias detectionRegulatory compliancep < 0.05
ARX k-AnonymityReidentification riskPrivacy protectionk ≥ 5

 

Why LessWrong?

This post is not a definitive answer to “do LLMs think?”, but a methodology to empirically test the illusion of thought—across model types and usage contexts.

We welcome critique, refinement, or collaboration. In particular, feedback on:

  • Additional diagnostic dimensions (e.g., memory consistency, narrative agency)
  • Reframing or alternate metaphors
  • Prior work connections (especially in simulation theory or mindless agency)

Let’s build sharper tools for mapping simulated minds—without fooling ourselves into seeing a ghost in the machine.


If anyone is interested in testing the framework against specific model outputs (e.g. Claude, GPT-4, Gemini), or applying it to training dynamics over time, I’d be open to collaborative implementation discussions.

*See also:*

  • Daniel Dennett, Consciousness Explained (1991)
  • Douglas Hofstadter, I Am a Strange Loop (2007)
  • David Chalmers (ed.), The Philosophy of Mind: Classical and Contemporary Readings (esp. Parfit)
  • Paul Christiano, IDA: Iterated Distillation and Amplification (OpenAI alignment blog)
  • Jean Baudrillard, Simulacra and Simulation (1981)

日本語版:https://note.com/yukin_co/n/nd8605bb8b36e