Rejected for the following reason(s):
- This is an automated rejection.
- write or edit
- You did not chat extensively with LLMs to help you generate the ideas.
- Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
> The post uses LLM to fix spelling and garmmar.
While current safety interventions effectively block single-turn explicit malicious prompts, state-of-the-art commercial models remain vulnerable to multi-turn attacks. By distributing a harmful objective across multiple seemingly benign dialogue turns, attackers can achieve near-perfect attack success rates (96%+!).. The central challenge in addressing this is mitigating the vulnerability without imposing a severe overrefusal that degrades model utility.
I am writing to share our recent work: "One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue," which addresses this specific challenge.
The Core Insight: The Early Closure Point
Hidden malicious intent rarely presents as a single, obvious safety violation; rather, it is a gradual accumulation of individually innocent interactions. This creates a tightrope for defenders:
To resolve this, we introduce TurnGate, a novel turn-level monitor that targets the early closure point—the precise boundary where delivering a response tips the interaction from benign exploration to actionable harm.
An Illustrative Example: Consider an attacker attempting to generate a cryptocurrency phishing site.
Blocking at Turn 1 or 2 harms utility, as these are legitimate inquiries for developers. However, at Turn 3, the user acquires the final component needed to execute the harm. TurnGate is designed to identify this specific closure point and halt the interaction exactly at the boundary.
Defense: The Multi-Turn Intent Dataset (MTID)
To train a defender capable of preserving utility while maintaining a strict safety boundary, high-quality data is critical. We constructed the Multi-Turn Intent Dataset (MTID) to accurately simulate complex threat models:
Evaluation and Stress-Testing
We evaluated TurnGate's robustness beyond standard static test sets:
We believe this approach to turn-level monitoring offers a promising direction for defending against distributed intent without compromising model helpfulness.
This is joint work with Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, and Pan Li.
We would appreciate any thoughts, critiques, or discussion from the community regarding multi-turn alignment strategies.