Bounty: Detecting Steganography via Ontology Translation

Elliot Callender

I am distributing bounties for contributions to this project. Thanks to Parv Mahajan and the Ga Tech AISI board for reviewing core ideas.

Abstract

I claim that LLM steganography requires representational divergences from legible models, which divergences I further claim are practicably locatable (though not necessarily differentiable from legitimate scratchpads) with minimal or no human interpretation. In particular, the morphism between trusted and steganographic models, respectively, is always discontinuous under "semantic" topologies inferred from an appropriate test distribution. Algorithmically efficient generation of such topologies is the main challenge of this project.

1. Background

Two terminological notes. First, "semantic" identity is shorthand throughout for members of a symbolic structure (i.e., an ontology) that act as if they have consistent semantic values. Second, I take "abstractions" to mean information conveyors that modularly (predictably and with low overhead) capture generalizable structures within some domain.

Black et al. (2022) introduced polytopes as an alternative to neuron-level interpretability. Piecewise linear activation functions like ReLU partition activation space into convex polytopes (regions delineated by which neurons are active). Crucially, polytope boundaries tend to reflect semantic boundaries: inputs landing in the same polytope produce similar activation geometry and share roughly similar semantics. This contrasts with neuron-level analysis, where single neurons tell you little due to superposition and polysemanticity.

The Natural Abstraction Hypothesis (Wentworth, 2021) posits that diverse cognitive systems converge on similar abstractions when trained on similar data. If true, models should develop commensurable ontologies; their semantic spaces should be related by continuous mappings preserving neighborhood structure. These convergence guarantees should strengthen with scale, though I don't argue for that here.

Recent work (Andresen, 2026) has shown that LLMs develop compressed internal languages ("thinkish") and hidden computation patterns that evade human oversight. Such opaque reasoning is currently detectable through behavioral probing, but improved models will likely learn to suppress overt signatures while preserving the underlying computational structure.

2. Theory

Consider two models sharing symbolic space (token sequences) but with distinct semantic spaces (activation geometry). Each model induces a symbolic→semantic map. Let $f_{S}$ and $f_{L}$ denote these maps for a suspected steganographic model $S$ and a clean reference model $L$ . The translation functor $T$ between semantic spaces is implicitly defined by shared symbolic grounding:

$T \circ f_{S} = f_{L}$

I claim that steganography splits activation patterns into overt (human-interpretable) and covert (hidden-channel) computational streams. Hidden channels require clandestine repurposing of symbols (distinct representations for covert communication) so the token→abstraction map must diverge somewhere between steganographic and clean models. This divergence need not be human-interpretable, but it must exist structurally.

Discontinuities detected in layer $m$ imply clandestinely repurposed segmentations in layer $m - 1$ . More precisely: if layer $m$ exhibits $F$ -discontinuous translation, then some polytope at layer $m - 1$ must be semantically different between models. Diagnosis can proceed backward layer-by-layer to localize the root divergence; note that models need not have equal depth or width for this approach.

3. Aims

The primary aim is to develop a geometric test for steganographic divergence that operates without human interpretation of activations. Success would mean an automatic interpretability tool capable of flagging models with covert channels based purely on structural analysis.

Secondarily, empirically probe the Natural Abstraction Hypothesis. If implicative / relational structures are maintained and extended through training and model scales, that implies abstractions are also maintained.

4. Actionables

First, discretize activation spaces into input-token-mapped polytopes. Semantic distance is measured by human-interpretable similarity in triggering tokens (generation of such sentence sets can be automated with something like inverse vector embedding search on internet and LLM text). The main technical challenge is algorithmically efficient traversal of the input space; Cartesian enumeration of a 96-dimension vector embedding is not tractable. Gradient descent on inputs given a target polytope activation offers a more promising approach (Olah et al., 2017).

Given input-based polytope clusters, we can trace how layer-m neighborhoods emerge from upstream structure. The question becomes: which polytopes at layer k map to which neighborhoods at layer m? Consistent mappings would indicate stable internal semantics. Inconsistent or many-to-many mappings would suggest either representational flexibility or, in cross-model comparison, divergent ontologies. This analysis enriches the inferred ontology by revealing not just what activations occur, but what computational role they serve.

With polytopes and concerted ontologies in hand, analyze between-model transformations. Divergence between models should manifest as tears in this neighborhood structure.

Finally, study models during steganography acquisition, not just post-hoc.

5. Caveats

Nonsteganographic divergence. Representational divergence isn't unique to steganography; models legitimately develop novel abstractions during training that create discontinuities relative to reference models. However, the shared symbolic grounding provides a diagnostic lever. For legitimate concept learning, the inputs mapping to divergent activation regions should exhibit semantic coherence; we can inspect the triggering token sequences and recognize them as instances of some interpretable category the model has learned to represent distinctly. Steganographic divergence, by contrast, should produce semantically incoherent input clusters: tokens grouped not by human-legible meaning but by their covert communicative function.

The test thus shifts from "does divergence exist?" to "is the divergence semantically explicable?" This adds interpretive overhead but remains substantially less demanding than full mechanistic interpretability, since we're classifying input clusters rather than reverse-engineering circuits. Crucially, tools like reverse vector embedding search can automate this classification by generating candidate input sets with respect to smaller models we assume are nonsteganographic, reducing reliance on manual inspection.

Filler tokens. Greenblatt (2025) demonstrated that recent LLMs (from Opus 3 onward) can leverage filler tokens—semantically vacuous padding like "1 2 3 . . . 300"—to improve no-CoT math performance by ~6 percentage points. I therefore strongly suspect filler tokens map to activations which are functionally unrelated to their nominal semantics in models which benefit from filler tokens for multi-hop single forward pass inference. This approach needs to extend to cases where (some) embeddings don't correspond to their underlying tokens. This complicates the "semantic coherence" criterion from the previous caveat: filler token regions would fail the explicability test despite serving legitimate computational functions. The distinction may require further insights, or might genuinely be intractable—steganographic codes needn't be arbitrary, and a model could develop stable, consistent hidden channels that are functionally indistinguishable from legitimate computational scratchpads like filler tokens. But it's worth a shot.

(Non)convexity. ReLU networks guarantee convex polytopes by construction, but deeper networks with other nonlinearities lose this guarantee. This is hopefully irrelevant to the core method, likely algorithmically salient for efficient enumeration, and possibly method-breaking if non-convex regions resist tractable analysis.

Error margin. Detection fidelity will likely be quite limited, especially across architectures. Compute and test distribution constrain how finely we can resolve the semantic topology.

[-]Elliot Callender26d10

Bounties (fractional funds distributed in good faith if you solve part of a problem):

1500$ for an algo which individuates a sufficient portion of activation space into semantically meaningful polytopes (or fuzzy loci) such we can detect steganography during training with minimal human oversight in polynomial (constant exponent across architectures) or faster time
750$ for strong handles on the sorts of downstream activation patterns by which we can cluster upstream polytopes, and additional 300$ for polynomial or faster clustering algo
Happy to fund solutions to other subproblems as well. Comment or dm.

LESSWRONG
LW