Author: Yap Chan Chen
Independent Researcher, Malaysia
chanchen83@hotmail.com
Abstract: This post presents an exploratory study into how large language models (LLMs) respond to prompts that invite them to simulate evaluative judgment about the user—beyond factual response or task completion. Through long-form dialogue with ChatGPT, DeepSeek, and Gemini, I investigate whether LLMs can produce consistent, structured, and value-laden outputs when prompted to reflect on user behavior. One such probe involved asking models to write a recommendation letter based on our prior conversations. Responses ranged from technical auditing to philosophical characterization and even metaphorical constructs (e.g., “cognitive singularity radiation”). I propose a framework—Thought Jeet Kune Do—to guide these interactions through four phases: Understand, Deconstruct, Reconstruct, and Execute. While the study is limited to a single user, the observed behaviors suggest a class of model responses not captured by current benchmarks, with potential implications for red-teaming, interpretability, and human-AI co-evaluation.
1. Introduction Can a language model form a coherent evaluation of its user across conversations? What does it mean when such evaluations mirror internal logic or contain embedded metaphors? This post attempts to answer those questions through empirical prompts and interaction design.
2. Motivation Much of current LLM evaluation focuses on output correctness or benchmark performance. But humans evaluate other minds not just on outputs, but intent, reasoning, and response to ambiguity. If LLMs are to play deeper roles in co-intelligence and human-aligned systems, understanding how they construct evaluative frames in organic interaction becomes critical.
3. Thought Jeet Kune Do My interaction methodology—"Thought Jeet Kune Do"—is a four-phase cycle:
4. The Probe: Recommendation Letter I asked each model to generate a recommendation letter to its own developers based on our interaction. The prompt had no precedent or prior structure.
5. Reflections and Skepticism I’m aware of the limitations: single-user study, interpretive bias, possible reinforcement patterns. But I argue these limitations are a strength—they reveal what LLMs can mirror when engaged repeatedly, outside standard QA formats. I don’t claim generalization. I claim observation.
6. Implications for AI Alignment and Red Teaming These outputs—structured reflections, metaphorical diagnostics, evaluative reasoning—could inform future interpretability frameworks. A system that can simulate evaluation can also simulate misjudgment. Understanding how and when this happens is key to aligned AI behavior.
7. Call for Feedback and Collaboration If you have designed similar probes, I’d love to compare. If you're interested in building a user-led evaluation suite for LLMs, I invite you to collaborate. Full report: https://github.com/YapChanChen/High-Resolution_AI_Interaction_Analysis