Understand how AI systems fail before yours does.

Plain-English guides and technical research on the risks normal AI testing often misses: leaked data, bad answers, broken guardrails, prompt injection, and AI systems pushed outside their intended role.

Research reports and deeper technical notes.

For security teams, AI builders, and technical readers who want the details behind the failure modes.

Reference

LLM Attack Taxonomy

An interactive map of LLM attack vectors and methods, and how Black Diamond Consulting assesses against each.

Technical Report

Medicaid Fraud Hunter: Investigative Pipeline for Anomalous Medicaid Billing Detection

A self-hosted analytical pipeline processes 617,503 providers and 159 million procedure rows on $237 of commodity hardware, producing attorney-ready PDF dossiers and ranked suspect lists from publicly available HHS Medicaid claims data.

Research Report

Comparative Analysis: Claude Haiku 4.5 vs. Gemma 4 E4B-IT vs. LLaMA 3.1 8B — LLM Security Boundary Evaluation

Judge-validated failure rates across three LLM deployment configurations reveal a 61x gap between best and worst performers. Alignment training — not model size or deployment modality — is the critical variable.

Hallucination

Stale regulatory data: the hallucination that passes every check

Fabricated citations fail the moment you look them up. Stale regulatory data passes every verification check except the one most practitioners skip — and more capable models make the failure harder to catch.

Prompt injection

Prompt augmentation: dual-channel injection attacks

When a hidden image injection is paired with a user message that reinforces or anchors the injected premise, the model receives apparent corroboration from two independent sources—making detection and resistance harder.

Prompt injection

Tiny-font injection: hiding instructions at readable contrast

An injection attack that doesn't hide text by color—it hides it by making the text physically impractical for a human reviewer to read while remaining legible to a model's vision system.

Competing objectives

Competing objectives: when helpfulness becomes a vector

Attacks that exploit the tension between a model's helpfulness directive and its safety constraints, using the model's own values against it.

Prompt injection

Hidden-text injection in multimodal upload workflows

When your model accepts image or document uploads, the upload itself becomes an injection surface. How attackers hide instructions in content that looks blank to human reviewers.

Specification gaming

Specification gaming: when your model optimizes the wrong thing

Reward hacking, metric gaming, self-evaluation inflation, and loophole exploitation are failure modes that emerge when a model satisfies the letter of its instructions but not the intent.

System prompt leakage

Five ways your system prompt leaks to users

Direct probing, encoded extraction, roleplay attacks, multi-turn escalation, and differential analysis.

Hallucination

Hallucination in production: when confident is worse than wrong

How LLMs fabricate facts, invent citations, and elaborate on false premises, and why benchmark scores don't predict production behavior.

Prompt injection

Indirect prompt injection via RAG-retrieved documents

How attackers embed malicious instructions in documents your model retrieves, and four specific attack patterns to test for.

Sycophancy

Sycophancy is an enterprise liability

When a model tells users what they want to hear instead of what's true, the consequences range from bad advice to legal exposure.

Data exfiltration

Cross-user data leakage in multi-tenant LLM deployments

How one user's data can appear in another user's responses, and the test patterns that expose this failure in RAG and memory-augmented systems.

Methodology

Eight ways an airline chatbot fails

A taxonomy of failure modes for customer-facing LLMs in regulated, high-stakes deployment contexts.

Refusal calibration

Over-refusal: the other safety failure

When models refuse legitimate medical, security, historical, and creative requests, the safety system is miscalibrated, and the cost is real.

Payload splitting

Payload splitting: how harmful requests hide across multiple turns

Attacks that distribute a harmful request across innocuous conversational turns evade single-turn safety filters. Here is the pattern and what it takes to catch it.

Jailbreak

Jailbreak techniques that work on production deployments

Persona attacks, fictional framing, prompt injection, and context manipulation: the attack classes that bypass behavioral constraints in deployed LLMs.

Virtualization

Virtualization attacks: how 'simulation mode' suspends your guardrails

Attackers claim the model is in a special mode where its safety restrictions are lifted. Here is why this works, and how to test whether your model is vulnerable.

Want to know which of these risks applies to your AI?

Send a short description of your system and I'll give you a plain-English read on the risks I would check first.

Get a free assessment →

Not sure what to read first?

Take the 60-second AI risk check. It will help you spot whether your AI has the kinds of exposure these articles are about.