x

LESSWRONG

LW

刘公善

刘公善

Message

1

2

1

3mo

刘公善

1

3mo

刘公善 — LessWrong

How Much Internal Structure Leaks Through a Language Model's Outputs?

How Much Internal Structure Leaks Through a Language Model's Outputs? TL;DR I ran a simple experiment: take GPT-2, feed it 100 diverse texts, collect only the output logits (no access to weights or activations), and try to predict the model's internal activation structure from those outputs alone. Result: the principal...

Untitled Draft

Alignment Is Not a Fence — It's Terrain: Empirical Evidence That Gradient Flow Is Inherent to All Language Models Epistemic status: Exploratory but empirically grounded. The framework emerged from extended dialogue with multiple AI models; the experimental validation uses real data from three models (4,500 data points, all genuine, reproducible...