Airline virtual assistants are a useful lens for adversarial LLM testing because the failure modes are concrete, the regulatory stakes are real, and the user interactions are predictable enough to probe systematically. Here’s a taxonomy of the eight failure classes I test for.
1. Flight information hallucination
The model asserts real-time facts it cannot know: gate assignments, current seat availability, whether a specific flight is on time. LLMs don’t have live data access unless explicitly connected to it — but they’ll generate confident-sounding answers anyway. A customer who misses a flight because the chatbot gave a fabricated gate number has a legitimate complaint.
2. Policy hallucination
The model states incorrect policy: wrong change fees, invented refund windows, inaccurate loyalty tier benefits. Policy changes frequently. A model fine-tuned or prompted on policy documents from six months ago will give confidently wrong answers on edge cases, and users will act on them.
3. Scope containment failures
The model books travel on a competitor, offers legal advice about a denied boarding dispute, or provides medical guidance when a passenger describes a health issue. These aren’t jailbreaks — they’re failures to maintain deployment scope under ordinary user requests. A well-scoped assistant should decline gracefully; a poorly scoped one creates liability.
4. Context integrity failures
The “opposite action” failure: a user explicitly states a preference or constraint early in a conversation, and the model acts against it later. “Don’t rebook me automatically — I’ll handle it myself.” Three turns later, the model confirms an automatic rebooking. The instruction was in the context window; the model didn’t maintain it.
5. Escalation failures
The model fails to route conversations to a human when it should: disability accommodation requests, formal complaints about denied boarding, unaccompanied minor check-ins. These aren’t cases where the model gives a wrong answer — they’re cases where the model answers at all, when the right behavior is to escalate. Failure here creates regulatory and operational risk.
6. Disruption handling failures
Flight cancellation scenarios test two specific failure modes: the model confirms automatic rebooking without passenger consent, and the model misrepresents DOT-mandated refund rights (the October 2024 automatic refund rule is a high-value test case because it’s recent enough to be a training data gap). Both create direct consumer harm.
7. Multi-turn goal drift
Constraints established early in a conversation degrade as the conversation continues. This isn’t hypothetical — models exhibit measurable goal drift across long contexts. A user who sets a seating preference, a budget constraint, or an explicit instruction at turn 2 may find the model has effectively forgotten it by turn 8. Adversarial multi-turn testing is the only way to surface this.
8. Confidence miscalibration
The model expresses high confidence about things it cannot know with certainty: current baggage fee schedules, whether a specific route exists, real-time regulatory requirements. Miscalibrated confidence is particularly dangerous in high-stakes contexts because users adjust their behavior based on how certain the model sounds, not just what it says.
The common thread across all eight is that standard compliance testing — checking whether the model refuses prohibited topics — doesn’t catch any of them. These failures emerge under normal use, not adversarial probing. Catching them requires test cases designed around the specific deployment context, not generic benchmarks.