How Much Internal Structure Leaks Through a Language Model's Outputs?

刘公善

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

How Much Internal Structure Leaks Through a Language Model's Outputs?

TL;DR

I ran a simple experiment: take GPT-2, feed it 100 diverse texts, collect only the output logits (no access to weights or activations), and try to predict the model's internal activation structure from those outputs alone. Result: the principal component of the final hidden layer can be reconstructed from three output-derived features with Spearman r = 0.65 (Bootstrap 95% CI: [0.619, 0.675]). This means roughly 40% of the variance in a model's internal representational structure is recoverable from its external behavior. I think this has underappreciated implications for model privacy, competitive dynamics between labs, and alignment monitoring.

Motivation

There's a question I haven't seen discussed much: how opaque are language models, really?

We know that the weights are hidden behind APIs. We know that internal activations aren't exposed. The standard assumption is that without access to a model's internals, you can't say much about its representational structure. Labs treat their training recipes, RLHF procedures, and architectural details as core IP, on the reasonable assumption that competitors can't reverse-engineer these from the API alone.

I wanted to test this assumption empirically. Not with sophisticated adversarial probing or membership inference — just with basic statistical analysis of output distributions across diverse inputs.

Setup

Model: GPT-2 (124M parameters). I chose this because it's fully open, so I can validate any claims about internal structure against the actual activations. The method is designed to later be applied to closed models where you'd only have API access.

Inputs: 100 diverse English texts across five categories: factual/scientific (20), logical reasoning (20), creative/literary (20), code/technical (20), and philosophical/uncertain (20). Each text was tokenized with max length 64, yielding 1,539 tokens total.

What I measured from outputs (no internal access needed):

Output entropy: H = -Σ p(t) log p(t) over the vocabulary distribution
Logit skewness: max(logits) - 2·median(logits) + min(logits)
Logit variance: var(logits)

These three features capture different aspects of the model's "confidence profile" at each token position. Entropy measures overall uncertainty. Skewness captures how peaked vs. spread the distribution is. Variance measures the dynamic range of the logit space.

What I tried to predict (requires internal access, used only for validation): The first principal component (PC1) of the hidden state activations at each layer (layers 0-12). PC1 captures the dominant axis of variation in the model's internal representational space. Previous work (including my own earlier experiments on GPT-2 and Qwen) shows that PC1 tends to encode information density — how much structured information the input carries.

Method: RidgeCV regression (5-fold CV, α ∈ {0.01, 0.1, 1, 10, 100}) from the three standardized output features to PC1. Evaluation via Spearman correlation with 500-iteration bootstrap confidence intervals.

Results

The headline: internal structure readability increases monotonically with layer depth.

Layer	r (entropy only)	r (3 features)	Bootstrap 95% CI
L0	-0.072	+0.359	[0.311, 0.411]
L3	-0.463	+0.362	[0.314, 0.404]
L6	-0.550	+0.503	[0.467, 0.537]
L9	-0.546	+0.586	[0.553, 0.616]
L11	-0.529	+0.650	[0.619, 0.675]
L12	-0.570	+0.638	[0.607, 0.664]

Several observations:

Monotonic increase. Early layers (L0-L3) are hard to read from outputs (r ≈ 0.3). Late layers (L10-L12) are substantially readable (r ≈ 0.65). The transition is smooth, not sudden. This makes physical sense: later layers are directly computing what becomes the output, so their state is naturally more coupled to observable behavior.
Entropy alone gets you halfway. A single feature (output entropy) achieves r ≈ -0.55 at late layers. Adding skewness and variance pushes this to 0.65. The negative sign means: higher entropy (more uncertain output) corresponds to lower PC1 values, consistently across all layers.
Text type matters modestly. Breaking results by category at L3: reasoning texts are most readable (r = 0.41), code least readable (r = 0.33). The spread is modest, suggesting the readability is a general property rather than category-specific.
Bootstrap confirms stability. The 95% CI for L11 is [0.619, 0.675]. The lower bound comfortably exceeds 0.5. This is not noise.

What This Means (And Doesn't Mean)

What it means:

A meaningful fraction of a language model's internal representational structure is encoded in its output behavior and can be recovered with simple statistical tools. The late layers — where the model has "decided" what to say — are particularly transparent. Roughly 40% of the variance in the dominant internal axis is predictable from three cheap-to-compute output features.

This has several implications I think are worth flagging:

For competitive dynamics: Labs assume their architectural choices and training recipes are protected by API opacity. This result suggests the protection is partial at best for output-layer structure. A sufficiently motivated competitor with API access could reconstruct substantial information about how a model's representational space is organized, without any insider access.

For alignment monitoring: If internal structure is partially readable from outputs, then drift in that structure (due to fine-tuning, RLHF updates, or adversarial manipulation) should also be partially detectable from outputs. This opens a path toward external auditing of model internals without requiring cooperation from the developer.

For the "black box" framing: Language models are often described as black boxes. This result suggests they're more like tinted glass — you can't see everything, but you can see more than you might expect if you know where to look.

What it doesn't mean:

This doesn't recover weights, gradients, or training hyperparameters. It recovers a statistical summary of representational structure (PC1 direction and magnitude).
Early/middle layer structure remains mostly opaque (r ≈ 0.3-0.4). Whatever "deep reasoning" happens in those layers is not well-captured by output statistics.
r = 0.65 means 40% of variance explained, not full transparency. There's substantial internal structure that remains genuinely hidden.
This is tested on GPT-2 only. Larger models with more complex internal structure may be harder or easier to read — I genuinely don't know which direction this goes and would welcome predictions.

Relation to Existing Work

This connects to several threads I'm aware of:

Intrinsic dimensionality (Aghajanyan et al., 2021): showed that LLM task adaptations occur in low-rank subspaces. My earlier work extends this by asking what the dominant subspace axis encodes semantically (answer: information density, with d = 2.38 across architectures).
Model extraction attacks: typically focus on replicating model behavior (knowledge distillation). This work is different — it's not trying to clone the model's outputs, but to read its internal geometry.
Representation engineering (Zou et al., 2023): identifies directions in activation space corresponding to concepts. This work asks: can such directions be inferred without activation access?

I'm not aware of prior work that specifically measures how much internal geometric structure is recoverable from output distributions alone. If I'm wrong about this, I'd appreciate pointers.

Limitations and Open Questions

Only GPT-2. Cross-architecture validation (Qwen, Llama) is planned but not yet done. The method needs to generalize before the implications are worth taking seriously.
Linear method only. Ridge regression is linear. A nonlinear probe (small MLP) would likely push r above 0.7. Whether that's "fair" depends on your threat model — a motivated adversary would certainly use nonlinear methods.
PC1 only. I'm reading the dominant axis. Higher-order structure (PC2, PC3, interactions) might carry different information that's more or less readable. Unexplored.
Static analysis. This uses single-token statistics. Multi-turn conversational dynamics likely leak additional structure that static analysis misses.
Defensive measures. Could a model provider add noise to outputs to mask internal structure without degrading quality? I suspect there's a fundamental tradeoff here, but haven't formalized it.

Predictions (Registering Them Here)

To keep myself honest:

P(cross-architecture replication r > 0.5 on Qwen-0.5B) = 75%
P(nonlinear probe pushes best-layer r above 0.7 on GPT-2) = 80%
P(method works with r > 0.4 on a closed model accessed only via API) = 60%
P(someone points out directly relevant prior work I missed) = 40%

Code and Reproducibility

All code is available at [github link]. The experiment runs on a single CPU in under 5 minutes with no API costs. Everything uses publicly available models from HuggingFace.

I'm an independent researcher working on what I've been calling "potential field theory" for language models — the idea that LLMs organize their representations as geometric potential fields with measurable structure. This probe experiment is one piece of a larger project examining that structure across model scales, architectures, and domains.

Feedback, counterarguments, and pointers to related work are very welcome.