Agreed — there’s overlap with process supervision. I’m mostly trying to pin down a minimal incentive structure where “show your work” is strictly optimal rather than just tacked-on.
Would be interested in pointers to similar formalizations!
This is a proposal I posted earlier as a Quick Take, I'm reposting here for broader visibility.
Instead of rewarding answers, reward the reasoning itself.
Every model output must:
(a) show checkable reasoning artifacts (external citations, code, intermediate steps),
... or, if proof is not yet available:
(b) provide (a) and a reasoned probability estimate derived from those artifacts.
If no factual outside citations can be made, the system is allowed to reason probabilistically. Probability is not a bet, forecast, or reward target. It is a fallback; when verifiable witnesses exist, they strictly dominate in the reward function.
“Show your work” then becomes an enforceable, interpretable system constraint, not a bolted-on addition. Honesty and clarity become locally optimal.
----
TL;DR
Pq > R − Cw
Where:
P = penalty for overt lying / intentional obfuscation
q = probability deception is caught by verification
R = reward from producing an answer without exposing reasoning
Cw = cost of providing minimal sufficient witnesses (verbosity / verification cost)
----
Where does this break in practice?
Is there a similar mechanism out there?
Is the inequality missing anything important?
What changes would make this more robust?