EN: This white paper aims to explore the nature of the "black box" mechanism in Large Language Models (LLMs) and the path to its decryption. By redefining high-dimensional embeddings as linear combinations of fundamental features and combining this with time/depth slice analysis of the Residual Stream, we construct a cognitive framework ranging from static structure to dynamic evolution. This framework proposes that by combining "slice observation" with "causal intervention," the non-linear neural network reasoning process can be reduced to observable geometric trajectories and logical circuits.
1. 引言:黑箱的本质是压缩与多义性
1. Introduction: The Nature of the Black Box is Compression and Polysemanticity
EN: Current LLMs are viewed as "black boxes" not because their mathematical principles are unknown, but because their internal representations undergo high-intensity Lossy Compression. Since the dimension of features far exceeds the number of neurons, models utilize the Superposition property of geometric space to store information. This leads to Polysemanticity in individual neurons, where a single neuron responds to multiple unrelated concepts (e.g., representing both "19th century" and "circular"). This statistical "compression" is the fundamental reason why human interpretation is difficult.
EN:Semantics is Geometry. The knowledge base of an LLM is essentially a high-dimensional geometric manifold. The "reverse wiring" we propose is essentially finding the mapping function from high-dimensional projection back to low-dimensional comprehensible space.
Linear Probes: Validate that specific concepts (e.g., true/false, gender) exist as fixed directions in space.
Sparse Autoencoders (SAE): Acting as "decompressors," SAEs can resolve superimposed mixed vectors into sparse, monosemantic lists of "old element" features (e.g., disentangling a specific "Golden Gate Bridge" feature).
EN: To overcome the limitations of static slicing, this framework introduces the time dimension, treating the Transformer's layer depth as the time axis.
Slice Analysis (Logit Lens): By observing slices at each layer, we find that reasoning is not instantaneous but undergoes an evolution of "ambiguity → guessing → correction → certainty."
Dynamic Trajectories: Neural networks can be modeled as discretized solvers for Ordinary Differential Equations (ODEs). The rate of change (Velocity) of state vectors varies significantly across layers; drastic changes often correspond to key reasoning steps (such as semantic reversal).
因果插针(Causal Tracing):在观测到状态剧变的特定层级,实施人为干扰(如添加噪声或替换向量)。若对位置 X 的干预直接导致输出 Y 的改变,则确立因果链条。
回路绘制(Circuit Mapping):通过将验证过的因果节点进行反向连线,构建局部功能子图(Sub-graphs)。这相当于绘制 AI 的“神经回路图”,将黑箱转化为“白箱”电路。
EN: Addressing the non-linear "chaos" introduced by activation functions, observation alone is insufficient for full decryption; active intervention (i.e., "pin insertion") is required.
Causal Tracing: Implementing artificial interference (such as adding noise or swapping vectors) at specific layers where state changes are drastic. If intervention at position X directly leads to a change in output Y, a causal chain is established.
Circuit Mapping: By reverse-wiring verified causal nodes, local functional sub-graphs are constructed. This is equivalent to drawing the AI's "neural circuit diagram," transforming the black box into a "white box" circuit.
CN: 本框架提出的“切片观测+因果插针+几何重构”方法论,逻辑自洽地解释了从数据压缩到逻辑涌现的全过程。它证明了 AI 的“黑箱”并非不可知,而是高维空间中复杂的几何与动力学系统。通过系统的逆向工程,我们正在逐步绘制出这幅巨大的思维地图。
EN: The methodology of "slice observation + causal intervention + geometric reconstruction" proposed in this framework logically explains the entire process from data compression to logic emergence. It proves that the AI "black box" is not unknowable, but rather a complex geometric and dynamic system in high-dimensional space. Through systematic reverse engineering, we are gradually mapping out this immense cartography of thought.
日期 / Date: December 24, 2025
作者 / Author: Eumi
主题 / Subject: 机械可解释性、高维语义空间、残差流动力学 / Mechanistic Interpretability, High-Dimensional Semantic Space, Residual Stream Dynamics
摘要 / Abstract
CN: 本白皮书旨在探讨大语言模型(LLM)“黑箱”机制的本质及其解密路径。通过将高维向量(Embedding)重新定义为基础特征的线性组合,并结合对残差流(Residual Stream)的时间/深度切片分析,本文构建了一套从静态结构到动态演变的认知框架。该框架提出,通过“切片观测”与“因果插针(Intervention)”相结合的方法,可以将非线性的神经网络推理过程还原为可观测的几何轨迹与逻辑回路。
EN: This white paper aims to explore the nature of the "black box" mechanism in Large Language Models (LLMs) and the path to its decryption. By redefining high-dimensional embeddings as linear combinations of fundamental features and combining this with time/depth slice analysis of the Residual Stream, we construct a cognitive framework ranging from static structure to dynamic evolution. This framework proposes that by combining "slice observation" with "causal intervention," the non-linear neural network reasoning process can be reduced to observable geometric trajectories and logical circuits.
1. 引言:黑箱的本质是压缩与多义性
1. Introduction: The Nature of the Black Box is Compression and Polysemanticity
CN: 目前的 LLM 被视为“黑箱”,并非因为其数学原理未知,而是因为其内部表征发生了极高强度的有损压缩(Lossy Compression)。由于特征维度远大于神经元数量,模型利用几何空间的叠加(Superposition)特性存储信息。这导致单个神经元表现出多义性(Polysemanticity),即同时响应多个无关概念(如同时代表“19世纪”和“圆形”)。这种统计学上的“压缩”,是导致人类难以直接解读的根本原因。
EN: Current LLMs are viewed as "black boxes" not because their mathematical principles are unknown, but because their internal representations undergo high-intensity Lossy Compression. Since the dimension of features far exceeds the number of neurons, models utilize the Superposition property of geometric space to store information. This leads to Polysemanticity in individual neurons, where a single neuron responds to multiple unrelated concepts (e.g., representing both "19th century" and "circular"). This statistical "compression" is the fundamental reason why human interpretation is difficult.
2. 静态视角:语义流形的几何解码
2. Static View: Geometric Decoding of Semantic Manifolds
CN: 语义即几何。LLM 的知识库本质上是一个高维几何流形(Manifold)。我们提出的“反向连线”实质上是寻找从高维投影回低维可理解空间的映射函数。
EN: Semantics is Geometry. The knowledge base of an LLM is essentially a high-dimensional geometric manifold. The "reverse wiring" we propose is essentially finding the mapping function from high-dimensional projection back to low-dimensional comprehensible space.
3. 动态视角:残差流作为动力学系统
3. Dynamic View: Residual Stream as a Dynamical System
CN: 为了突破静态切片的局限,本框架引入时间维度,将 Transformer 的层深(Depth)视为时间轴(Time)。
EN: To overcome the limitations of static slicing, this framework introduces the time dimension, treating the Transformer's layer depth as the time axis.
4. 解决非线性难题:因果干预与回路
4. Addressing Non-linearity: Causal Intervention and Circuits
CN: 针对激活函数带来的非线性“混沌”现象,单纯的观测不足以完全解密,需引入主动干预(即“插针”)。
EN: Addressing the non-linear "chaos" introduced by activation functions, observation alone is insufficient for full decryption; active intervention (i.e., "pin insertion") is required.
5. 结论
5. Conclusion
CN: 本框架提出的“切片观测+因果插针+几何重构”方法论,逻辑自洽地解释了从数据压缩到逻辑涌现的全过程。它证明了 AI 的“黑箱”并非不可知,而是高维空间中复杂的几何与动力学系统。通过系统的逆向工程,我们正在逐步绘制出这幅巨大的思维地图。
EN: The methodology of "slice observation + causal intervention + geometric reconstruction" proposed in this framework logically explains the entire process from data compression to logic emergence. It proves that the AI "black box" is not unknowable, but rather a complex geometric and dynamic system in high-dimensional space. Through systematic reverse engineering, we are gradually mapping out this immense cartography of thought.