This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
I'm new here. I've been doing independent AI safety research and wanted to share findings I think this community would be interested in.
I'm sharing a preprint proposing a cross-session behavioral monitoring framework for detecting multi-turn jailbreak attacks and sabotage patterns that evade per-interaction defenses.
The core problem: every defense system I analyzed across ten papers — from CC++ to activation probes to SEMA — operates within a single interaction window. Adversaries who distribute intent across sessions are structurally invisible to all of them. The paper formalizes why (the Representation-Output Gap), proposes detection bounds (the Temporal Pincer Theorem), and maps the framework against Anthropic's Sabotage Risk Report, where the three pathways rated "Weak" for monitoring are exactly the patterns temporal trajectory monitoring is designed to catch.
Five falsifiable predictions with specific effect sizes are included.
I built a cognitive architecture called NexEmerge that implements the cross-session memory and behavioral trajectory tracking primitives the paper describes, which served as both the research platform and an engineering testbed for the proposed monitoring approach.
Preprint: https://doi.org/10.5281/zenodo.18712275
I welcome technical feedback, especially on the mathematical formalization and the proposed latency-tiered implementation. If anyone with interpretability or model access is interested in testing the predictions empirically, I'd be glad to collaborate.
I'm new here. I've been doing independent AI safety research and wanted to share findings I think this community would be interested in. I'm sharing a preprint proposing a cross-session behavioral monitoring framework for detecting multi-turn jailbreak attacks and sabotage patterns that evade per-interaction defenses. The core problem: every defense system I analyzed across ten papers — from CC++ to activation probes to SEMA — operates within a single interaction window. Adversaries who distribute intent across sessions are structurally invisible to all of them. The paper formalizes why (the Representation-Output Gap), proposes detection bounds (the Temporal Pincer Theorem), and maps the framework against Anthropic's Sabotage Risk Report, where the three pathways rated "Weak" for monitoring are exactly the patterns temporal trajectory monitoring is designed to catch. Five falsifiable predictions with specific effect sizes are included. I built a cognitive architecture called NexEmerge that implements the cross-session memory and behavioral trajectory tracking primitives the paper describes, which served as both the research platform and an engineering testbed for the proposed monitoring approach. Preprint: https://doi.org/10.5281/zenodo.18712275 I welcome technical feedback, especially on the mathematical formalization and the proposed latency-tiered implementation. If anyone with interpretability or model access is interested in testing the predictions empirically, I'd be glad to collaborate.