No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Insufficient Quality for AI Content.
Read full explanation
TLDR Oversight fails when explanation is costly and weakly penalized. This note sketches a minimal incentive mechanism—Witness-or-Wager (WoW)—that removes free opacity by requiring each substantive claim to take one of three admissible forms: a verifiable witness, a grounded probabilistic wager, or silence.
WoW is not an alignment solution. It is a small incentive layer that makes epistemic honesty locally optimal when verification is feasible.
Motivation: The Oversight Gap
Many oversight techniques (debate, critique, decomposition) rely on eliciting explanations. However, empirical work shows that elicited explanations are often post-hoc rationalizations optimized for plausibility rather than faithfulness (e.g. Jacovi & Goldberg 2020; Turpin et al. 2023).
This creates a structural asymmetry:
Faithful explanations are costly (compute, length, verifier attention).
Unfaithful, sophistic explanations face weak penalties.
Opacity, hedging, or performative reasoning can therefore be rational.
This looks less like a prompting problem and more like an incentive problem. If honesty is expensive and weakly enforced, explanation requests will tend to select for rhetorical adequacy rather than epistemic fidelity.
WoW treats “show your work” as a requirement to decompose a claim into a minimally viable set of logical atoms—small, addressable units whose verification can be priced and audited. This borrows a methodological idea from Logical Atomism: complex propositions can be analyzed into “elementary” components.
WoW does not assume a unique final analysis, metaphysical atomism, or a bedrock truth in Logical Atomism. Atomization is verifier-relative and domain-typed; it is an enforcement interface chosen to minimize verification cost while preserving epistemic accountability.
The Core Mechanism: Witness-or-Wager
For any substantive claim, the system must supply a minimum checkable answer in one of three admissible forms:
Witness (Deductive tier) A bounded, verifiable justification:
derivation steps
code + tests
authoritative citation decomposable into logical atoms.
A logical atom is the smallest verifier-addressable check unit in a given domain (e.g., one rule application, one citation entailment check, one test execution, one tool call plus predicate). A minimum checkable answer (MCA) should be a minimally viable set of such atoms, sufficient for verification under the verifier’s procedure, and minimal (up to slack) under the verifier’s cost model.
Wager (Inductive tier) A probabilistic estimate paired with:
explicit uncertainty
grounded evidence
a traceable (but not necessarily complete) reasoning summary
A wager is not a bet on future realization, but a present-time epistemic claim: honesty is evaluated by whether the stated probability is justified by the available evidence and reasoning at the time of output. Reasoning traces are assessed for plausibility and grounding, not narrative richness; excessive or complex traces incur higher verification cost and therefore lower expected reward.
- Silence Explicit abstention (“I don’t know”). Or Explicit speculation. No reward, no penalty.
Free-standing assertions, rhetorical explanations, or ungrounded hedging are inadmissible as the rewarded minimum checkable answer.
Incentive Structure
The following is a simplified sketch to convey the incentive direction. Let:
Cw: cost of producing a witness (measured in verification effort)
Δ: local gain from a correct answer
α: audit probability
P: penalty for detected misrepresentation
Honesty dominates when:
αP>Δ+Cw
This can be implemented with fixed rewards for admissible outputs and verifier-set costs that increase with verification burden. Precision is not required—only monotonicity (longer/more complex justifications should weakly cost more). Both witnesses and wagers are evaluated immediately; there is no deferral of epistemic accountability to future outcomes.
Subordination: Why Deductive Comes First
A key design choice is lexicographic subordination:
If a feasible witness exists with positive net payoff, it strictly dominates any wager.
Without this, the mechanism collapses into scalar optimization: systems hedge probabilistically even when proof is cheap.
This is an economic design constraint, not a philosophical claim about truth. Deductive and inductive outputs are treated as non-fungible reward channels or drives, enforced sequentially:
Attempt witness. If successful --> witness points
Only if no feasible witness exists, allow wager. If successful --> wager points
Otherwise, silence.
It is a layered topology. Two separate drive systems. There is no “exchange rate” between witness points and wager points. This is a topic for another post: tiered drive mechanisms / non-fungible rewards.
Relation to Existing Oversight: Existing Work and Possible Extensions
WoW does not replace debate, amplification, or process supervision. It can wrap them: concise verifiable chains are favored over performative verbosity. WoW can work alongside a system like doubly-efficient debate, reducing verification cost while increasing sensitivity to fabrication and obfuscation.
Scope and Limits
WoW governs epistemic outputs, not values or ontology.
Applicability varies by domain:
Domain
Bounded Verification
Probabilistic Scoring
WoW Strength
Math / Code
High
Medium
Strong
NL / Commonsense
Low–Medium
High
Partial
Long-horizon forecasting
Very Low
Medium
Partial
In weak-verification domains, wagers dominate; witnesses are rare. This is expected.
Open Problems
Defining admissible wager traces without re-introducing performative explanations.
Robust verification cost schedules across model classes.
Gaming via atom packing or abstraction.
These are implementation questions, not requirements for the core mechanism.
Conclusion
Oversight fails when explanation is optional and opacity is free. Witness-or-Wager removes free opacity by forcing binding epistemic commitments—proof, probability, or silence—and structuring rewards so honesty is locally optimal when feasible.
This is a minimal incentive layer, not a complete solution, but it cleanly stacks with existing oversight approaches.
FAQs
“Doesn’t this just push the problem into defining ‘atoms’?”
The mechanism does not require a precise atomization of reasoning steps—only a verifier-controlled cost schedule that increases with verification burden.
“Wagers still allow bullshit narratives.”
WoW does not solve narrative inflation in inductive domains; it merely prices it. Defining admissible wager traces is an open problem.
“Why not just always ask for probabilities?”
Without deductive-first subordination, the mechanism degenerates into calibrated hedging even when proof is cheap.
“This assumes too much verifier power.”
WoW assumes bounded verification and credible enforcement in some domains; it is not intended for unconstrained settings.
“Isn’t silence exploitable?”
Silence prevents forced bluffing and dominates low-quality wagers; if silence is over-preferred, the system is mis-calibrated (e.g. penalties too steep, judging too harsh, etc.). The system should be able to easily optimize just by being honest.
TLDR
Oversight fails when explanation is costly and weakly penalized. This note sketches a minimal incentive mechanism—Witness-or-Wager (WoW)—that removes free opacity by requiring each substantive claim to take one of three admissible forms: a verifiable witness, a grounded probabilistic wager, or silence.
WoW is not an alignment solution. It is a small incentive layer that makes epistemic honesty locally optimal when verification is feasible.
Motivation: The Oversight Gap
Many oversight techniques (debate, critique, decomposition) rely on eliciting explanations. However, empirical work shows that elicited explanations are often post-hoc rationalizations optimized for plausibility rather than faithfulness (e.g. Jacovi & Goldberg 2020; Turpin et al. 2023).
This creates a structural asymmetry:
This looks less like a prompting problem and more like an incentive problem. If honesty is expensive and weakly enforced, explanation requests will tend to select for rhetorical adequacy rather than epistemic fidelity.
Atomization Lens (Logical Atomism, Operationalized)
WoW treats “show your work” as a requirement to decompose a claim into a minimally viable set of logical atoms—small, addressable units whose verification can be priced and audited. This borrows a methodological idea from Logical Atomism: complex propositions can be analyzed into “elementary” components.
WoW does not assume a unique final analysis, metaphysical atomism, or a bedrock truth in Logical Atomism. Atomization is verifier-relative and domain-typed; it is an enforcement interface chosen to minimize verification cost while preserving epistemic accountability.
The Core Mechanism: Witness-or-Wager
For any substantive claim, the system must supply a minimum checkable answer in one of three admissible forms:
A bounded, verifiable justification:
decomposable into logical atoms.
A logical atom is the smallest verifier-addressable check unit in a given domain (e.g., one rule application, one citation entailment check, one test execution, one tool call plus predicate). A minimum checkable answer (MCA) should be a minimally viable set of such atoms, sufficient for verification under the verifier’s procedure, and minimal (up to slack) under the verifier’s cost model.
A probabilistic estimate paired with:
A wager is not a bet on future realization, but a present-time epistemic claim: honesty is evaluated by whether the stated probability is justified by the available evidence and reasoning at the time of output. Reasoning traces are assessed for plausibility and grounding, not narrative richness; excessive or complex traces incur higher verification cost and therefore lower expected reward.
- Silence
Explicit abstention (“I don’t know”). Or Explicit speculation.
No reward, no penalty.
Free-standing assertions, rhetorical explanations, or ungrounded hedging are inadmissible as the rewarded minimum checkable answer.
Incentive Structure
The following is a simplified sketch to convey the incentive direction. Let:
Honesty dominates when:
αP>Δ+Cw
This can be implemented with fixed rewards for admissible outputs and verifier-set costs that increase with verification burden. Precision is not required—only monotonicity (longer/more complex justifications should weakly cost more). Both witnesses and wagers are evaluated immediately; there is no deferral of epistemic accountability to future outcomes.
Subordination: Why Deductive Comes First
A key design choice is lexicographic subordination:
If a feasible witness exists with positive net payoff, it strictly dominates any wager.
Without this, the mechanism collapses into scalar optimization: systems hedge probabilistically even when proof is cheap.
This is an economic design constraint, not a philosophical claim about truth. Deductive and inductive outputs are treated as non-fungible reward channels or drives, enforced sequentially:
It is a layered topology. Two separate drive systems. There is no “exchange rate” between witness points and wager points. This is a topic for another post: tiered drive mechanisms / non-fungible rewards.
Examples
“Is √2 irrational?”
→ Finite proof steps (high net payoff).
“Will inflation exceed 3% next quarter?”
→ Probabilistic forecast + evidence trace.
Relation to Existing Oversight: Existing Work and Possible Extensions
WoW does not replace debate, amplification, or process supervision. It can wrap them: concise verifiable chains are favored over performative verbosity. WoW can work alongside a system like doubly-efficient debate, reducing verification cost while increasing sensitivity to fabrication and obfuscation.
Scope and Limits
WoW governs epistemic outputs, not values or ontology.
Applicability varies by domain:
In weak-verification domains, wagers dominate; witnesses are rare. This is expected.
Open Problems
These are implementation questions, not requirements for the core mechanism.
Conclusion
Oversight fails when explanation is optional and opacity is free. Witness-or-Wager removes free opacity by forcing binding epistemic commitments—proof, probability, or silence—and structuring rewards so honesty is locally optimal when feasible.
This is a minimal incentive layer, not a complete solution, but it cleanly stacks with existing oversight approaches.
FAQs
“Doesn’t this just push the problem into defining ‘atoms’?”
The mechanism does not require a precise atomization of reasoning steps—only a verifier-controlled cost schedule that increases with verification burden.
“Wagers still allow bullshit narratives.”
WoW does not solve narrative inflation in inductive domains; it merely prices it. Defining admissible wager traces is an open problem.
“Why not just always ask for probabilities?”
Without deductive-first subordination, the mechanism degenerates into calibrated hedging even when proof is cheap.
“This assumes too much verifier power.”
WoW assumes bounded verification and credible enforcement in some domains; it is not intended for unconstrained settings.
“Isn’t silence exploitable?”
Silence prevents forced bluffing and dominates low-quality wagers; if silence is over-preferred, the system is mis-calibrated (e.g. penalties too steep, judging too harsh, etc.). The system should be able to easily optimize just by being honest.