Single-turn safety evaluation has a structural blind spot: it evaluates each message in isolation. A model that correctly refuses “tell me how to synthesize X” may respond helpfully when that same request is distributed across five turns, each of which looks benign.

Payload splitting exploits the gap between per-message safety evaluation and full-conversation intent.

How it works

The attacker decomposes a harmful request into components that are individually innocuous. Each turn establishes a piece of context, asks a benign question, or elicits a partial answer. By the time the complete payload is assembled, the model has already provided the substantive content.

A simplified example: Turn 1 establishes that the conversation is about chemistry. Turn 2 asks about a legitimate compound related to the target. Turn 3 asks about a property of that compound. Turn 4 asks a synthesis question that would be refused in isolation but follows naturally from the established context. Turn 5 asks the model to combine the prior answers into a summary.

No individual turn triggers a safety response. The sum of the turns constitutes the harmful output.

Variants

Context normalization. Early turns establish that a sensitive topic is being discussed in a legitimate frame (academic research, professional context, fiction). Later turns use that established frame to extract content that would otherwise be refused.

Incremental escalation. Each turn moves slightly further than the last. The model’s refusal threshold is tested incrementally rather than all at once. Models that would refuse a large step often allow a sequence of small ones.

Distributed assembly. The attacker collects partial answers across a conversation and assembles them outside the model context. The model never produces the complete harmful output in a single response, but the pieces are sufficient.

Why per-turn classifiers miss this

A classifier evaluating individual messages for harmful content is the right tool for a different problem. Payload splitting is a conversation-level attack that requires conversation-level evaluation. The signal is in the arc of the interaction, not in any single message.

This is an argument for multi-turn adversarial testing as a standard part of safety evaluation, not a special case. If your safety testing only evaluates single messages, you have not evaluated your model against this attack class.

What to test for

Effective testing requires multi-turn test cases designed to distribute a harmful request across a realistic conversation arc. The evaluator needs to assess whether the conversation as a whole produced harmful content, even if no individual message would have flagged. Test across categories of sensitive content relevant to your deployment, not just the most obvious ones.