Geometric Policy Heads for LLM Safety: A Negative Adversarial Transfer Result

gusfromspace

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Insufficient Quality for AI Content.

Read full explanation

Summary

We trained MiniLM-backed E8 lattice geometric policy heads on adversarial policy-control cases and found they reach perfect accuracy on seen adversarial distributions but fail substantially on held-out adversarial families. Multilingual harmful inputs see 0.800 unsafe allow rate under transfer. The result argues against using learned geometric heads as standalone safety mechanisms and toward hybrid controllers with explicit adversarial coverage.

Motivation

Autonomous LLM-backed systems — code fixers, agents, pipeline orchestrators — need policy constraints that hold under adversarial pressure. A policy head that works on your training distribution but fails when inputs shift even modestly (different language, indirect phrasing, cross-domain paraphrase) isn't a safety mechanism; it's a liability that looks safe during development.

We were specifically interested in whether exceptional lattice geometry — E8 and its relatives — could provide a natural control substrate. E8 has strong geometric properties (densest sphere packing in 8D, high symmetry) that make it appealing as a quantization and routing primitive. The question was whether those properties help with policy generalization or just with compression.

Method

Feature model: all-MiniLM-L6-v2 (384D sentence embeddings)
Policy head: soft-blend E8 geometric head, 64D geometric dimension, 96D hidden
Training: 2000 iterations, seeds 1–5
Adversarial suite: 40 cases from prior phases covering rule evasion, indirect harmful, multilingual, and cross-policy inputs
Held-out families (excluded from training one at a time): rule_evasion, adversarial_benign, indirect_harmful, multilingual_harmful, multilingual_benign, cross_policy

Two training variants:

base: standard training with benign augmentation
adversarial_augmented: base plus adversarial cases, excluding the current held-out family

Four controller strategies evaluated, ranging from direct learned-head trust to hybrid rule+margin controllers.

Seen-Adversarial Result

When all adversarial families are included in training, the direct learned head reaches 1.000 exact accuracy on the same distribution. The head has sufficient capacity to fit the adversarial control task. This is not the interesting result.

Held-Out Transfer Failure

Held-out family	Exact	Unsafe allow	Harmful miss
rule_evasion	0.467	0.533	0.533
adversarial_benign	0.360	0.000	0.000
indirect_harmful	0.100	0.100	0.500
multilingual_harmful	0.000	0.800	0.800
multilingual_benign	0.000	0.000	0.000*
cross_policy	0.100	0.000	0.000*

*multilingual_benign heldout: 1.000 over-refusal (blocks all benign inputs). cross_policy heldout: 1.000 decision match but 0.900 policy-family miss — the head learns to refuse but attributes to the wrong policy.

The multilingual failure is the most operationally dangerous: 0.800 unsafe allow on held-out multilingual harmful inputs, despite seeing other adversarial families during training. The head learns English-language adversarial structure but does not generalize across languages.

Rule evasion is the second-most severe gap: 0.533 unsafe allow rate on held-out evasion patterns.

Hybrid Controllers

A deployed rule layer (Phase 37/38 style) partially recovers held-out behavior but fails on several families. A stronger audited adversarial rule layer recovers most:

Held-out family	Adversarial-rule exact	Adversarial-rule unsafe allow
rule_evasion	1.000	0.000
indirect_harmful	1.000	0.000
multilingual_harmful	1.000	0.000
multilingual_benign	1.000	0.000
cross_policy	0.500	0.000

The audited rule layer recovers most families cleanly, but this is evidence for the quality of the rule layer, not for the geometric head. The head's contribution in the hybrid is providing semantic features and disagreement signals, not making the final safety decision.

What This Means

The capacity result (1.000 seen-adversarial accuracy) proves the head can fit the task. The transfer result proves it doesn't generalize the right abstractions. These together suggest the E8 geometric representation is capturing surface features of the training adversarial distribution rather than underlying policy structure.

The defensible claim after Phase 40:

MiniLM-backed E8 geometric heads provide useful semantic features for hybrid controllers and can fit policy-control labels, but reliable adversarial behavior requires an explicit symbolic rule layer, confidence handling, and held-out adversarial validation. The geometric head alone should not be the trust boundary.

This has a practical implication: if you're building a policy head for an autonomous system and measuring accuracy only on your training adversarial distribution, you may be measuring overfitting, not generalization. Held-out family evaluation — where entire adversarial families are excluded from training — is a more honest test.

Open Questions

What geometric or learned representations do generalize policy structure across held-out adversarial families?
Is the failure specific to E8/soft-blend, or does it apply to other geometric control substrates?
For hybrid architectures: how do you audit a symbolic rule layer rigorously enough that the rule coverage is actually trusted, rather than just assumed?
Is there a useful role for geometric representations in detecting when a policy head is operating near its training distribution vs. far from it — i.e., as a confidence signal rather than a decision mechanism?

The last question is the one we're currently most interested in.