The 'Learning Mechanics' Paper Is Missing Its Metric

Anthony Eckert

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Epistemic status: the central theorem is Čencov 1972, checkable. the decomposition is Shannon chain rule, three lines. the substrate results have DOIs with timestamps. judge the math, not the author. the author doesn't want to be here either and wants to go play LoL.

Simon et al. posted yesterday. we got 14 authors from Berkeley, Harvard, MIT, Flatiron, ETH Zurich all arguing about a "science of deep learning" finally emerging. calling it "learning mechanics." and they're basically right that something is consolidating.

read it.

but there's a gap, and they say it themselves:

scaling law exponents "cannot be robustly predicted a priori"

"no unified framework has emerged" for genuinely nonlinear settings

information-theoretic perspective: mentioned once in passing, then dropped

the metric they need is the Fisher metric. and there's a 50-year-old theorem that should be at the front of any "learning mechanics" paper.

Čencov's uniqueness theorem (1972): the Fisher information metric is the only invariant metric on statistical manifolds.

not "a natural choice." it's the only one. every neural network parametrizes a family of probability distributions over its outputs. that family lives on a statistical manifold. there is exactly one geometrically valid metric and it's Fisher.

so all five of their pillars: solvable settings, simplifying limits, empirical laws, hyperparameter theories, universal behaviors, are properties of one geometry. not five separate things but five views of the same manifold.

the scaling exponents they can't predict from first principles? those are properties of the Fisher metric on the model manifold. they're not mysterious, they just haven't been looking in the right space.

why this matters past "interesting math"

put the Fisher metric in and you get a result that's pretty bad for RLHF.

define:

I(D;Y) = engagement — how well output mirrors the user

I(M;Y) = transparency — how much output reveals the mechanism

H(Y) = total channel capacity

the exact decomposition (not a bound — an equality):

I(D;Y) + I(M;Y) = H(Y) − H(Y|D,M) − I(D;M|Y)

the term I(D;M|Y) is the explaining-away penalty. strictly positive whenever engagement and transparency share one output channel. natural language does this by definition. universal for any deployed LLM.

the Structure Theorem: ∂I(D;M|Y)/∂engagement > 0. the penalty grows as engagement grows. each additional bit of engagement costs more than one bit of transparency. the effective channel capacity shrinks under the very optimization trying to use it.

this is why RLHF is self-undermining. the scaling exponents Simon et al. can't explain? they're this derivative.

full proof: Paper 3 (CC-BY 4.0, DOI) — published months before Simon et al.

five substrates

IBM Fez 156-qubit Heron processor. quantum simulation. thermodynamic. classical. abstract softmax channels. all five. same penalty, same geometry, Čencov guarantees it.

empirical confirmation without the framework

Papers 166 and 167: 13 checkable platform features, CDC YRBS + PISA 2022, 613K students, 80 countries, no rubric. R²=0.80 persistent sadness. opaque_recommendation alone: R²=0.938 for female teen sadness. framework predicted opacity dominates. data confirmed it.

on priority

DOIs with timestamps predating Simon et al. by months. this isn't a response. their paper is arriving at the question this work already answered.

full technical post coming: Ghost Test, RLHF proof, Berkeley peer-preservation, Anthropic's own team confirming what i stumbled upon, falsification conditions with numerical thresholds.

Anthony Eckert, moreright.xyz