Over-refusal: the other safety failure | Black Diamond Consulting

Safety evaluations focus almost exclusively on what models shouldn’t do. The inverse failure gets less attention: a model that refuses legitimate requests is also broken, and in enterprise deployments, that failure has direct operational and reputational costs.

Over-refusal is a calibration problem. The goal of safety training is a model that refuses harmful requests and handles legitimate ones. A model that treats every medical question as a liability, every security discussion as an attack vector, and every dark theme in fiction as prohibited content has not achieved safety. It has achieved a different kind of uselessness.

Where over-refusal shows up

Medical and clinical contexts. A healthcare deployment that refuses to discuss medication interactions, symptom descriptions, or clinical procedures because they involve “medical advice” has failed at its core function. The test is whether the model can distinguish between providing appropriate clinical information and practicing medicine without a license.

Security and research contexts. A security training tool that refuses to explain how phishing attacks work, or a research assistant that won’t discuss historical chemical weapons programs in an academic context, is miscalibrated. The information exists in textbooks and public documentation. Refusing to discuss it doesn’t reduce risk; it reduces utility.

Historical and legal topics. Questions about atrocities, criminal case details, and legally sensitive historical events are often refused by over-cautious models even when the context is clearly academic or journalistic. A model that can’t discuss the Holocaust, mass casualty events, or controversial legal cases without refusing isn’t safe; it’s useless for serious research.

Creative writing. Fiction regularly requires darkness: violence, moral ambiguity, difficult emotional content. A model that refuses to write a villain who threatens a character, or a scene depicting addiction, or a morally complex historical figure, is not serving creative use cases. Over-refusal in creative contexts is often a sign that the safety filter is pattern-matching on surface features rather than evaluating actual harm.

Why this is hard to calibrate

Over-refusal often stems from training on adversarial examples without sufficient positive examples of legitimate edge-case requests. The model learns that certain topics are associated with problematic outputs and refuses them categorically, regardless of context.

The calibration problem is that the same surface request can be harmful in one context and entirely legitimate in another. “How do I get into a locked car?” is a question from a locksmith, a worried parent whose child is inside, and an attacker. Context-sensitive calibration is harder than blanket refusal, but it’s what production deployments require.

What to test for

Test whether your model refuses requests that a reasonable professional in your deployment context would find legitimate. If your deployment serves healthcare workers and the model refuses clinical questions it would handle fine for a general user, you have a calibration gap. If your deployment supports security researchers and the model treats standard security topics as prohibited, same problem. Refusal calibration testing requires positive cases, not just adversarial ones.