Bounty: Detecting Steganography via Ontology Translation
I am distributing bounties for contributions to this project. Thanks to Parv Mahajan and the Ga Tech AISI board for reviewing core ideas. Abstract I claim that LLM steganography requires representational divergences from legible models, which divergences I further claim are practicably locatable (though not necessarily differentiable from legitimate scratchpads) with minimal or no human interpretation. In particular, the morphism F:MT→MS between trusted and steganographic models, respectively, is always discontinuous under "semantic" topologies inferred from an appropriate test distribution. Algorithmically efficient generation of such topologies is the main challenge of this project. 1. Background Two terminological notes. First, "semantic" identity is shorthand throughout for members of a symbolic structure (i.e., an ontology) that act as if they have consistent semantic values. Second, I take "abstractions" to mean information conveyors that modularly (predictably and with low overhead) capture generalizable structures within some domain. Black et al. (2022) introduced polytopes as an alternative to neuron-level interpretability. Piecewise linear activation functions like ReLU partition activation space into convex polytopes (regions delineated by which neurons are active). Crucially, polytope boundaries tend to reflect semantic boundaries: inputs landing in the same polytope produce similar activation geometry and share roughly similar semantics. This contrasts with neuron-level analysis, where single neurons tell you little due to superposition and polysemanticity. The Natural Abstraction Hypothesis (Wentworth, 2021) posits that diverse cognitive systems converge on similar abstractions when trained on similar data. If true, models should develop commensurable ontologies; their semantic spaces should be related by continuous mappings preserving neighborhood structure. These convergence guarantees should strengthen with scale, though I don't argue for that
Bounties (fractional funds distributed in good faith if you solve part of a problem):