AI risk guides and research
Understand how AI systems fail before yours does.
Plain-English guides and technical research on the risks normal AI testing often misses: leaked data, bad answers, broken guardrails, prompt injection, and AI systems pushed outside their intended role.
Start here
Plain-English guides for leaders and product teams.
These are the best starting points if you need to understand the business risk before getting into the technical weeds.
When confident is worse than wrong
How AI systems invent facts, citations, and explanations — and why this becomes a business risk in production.
Data and instructionsFive ways your system prompt can leak
A simple look at how users can push an AI into revealing instructions it was supposed to keep hidden.
Knowledge base riskWhen trusted documents become the attack
How malicious instructions can hide inside documents, web pages, tickets, or knowledge base content your AI reads.
Private dataHow one user's data can show up for another user
A plain-English guide to cross-user leakage in AI tools connected to memory, retrieval, or shared data sources.
Customer-facing AIEight ways an airline chatbot can fail
A practical example of how a public-facing AI can go wrong even when the basic chatbot experience looks fine.
Safety balanceWhen the AI refuses the right requests
AI safety is not only about blocking bad outputs. Sometimes the failure is refusing legitimate work.
Want this applied to your AI?
Get a free written risk assessment.
Send a short description of your AI system. I'll reply with the biggest risks I would check first.
Browse by risk
Find the problem that sounds closest to yours.
AI risk gets easier to understand when you start with the failure you are trying to avoid.
Users or content can steer the AI away from its rules
Prompt injection attacks use hidden or carefully worded instructions to change how the AI behaves.
JailbreaksPeople can push the AI into forbidden behavior
Persona games, fictional framing, and context tricks can bypass normal behavior limits.
Stale informationThe AI may sound current while using old information
Some wrong answers are hard to catch because they look verified, cited, and reasonable.
Uploaded filesFiles and images can contain hidden instructions
If your AI accepts uploads, the uploaded content itself can become a place for attacks to hide.
Wrong goalsThe AI can follow the letter of the rule and miss the point
Specification gaming happens when the system optimizes for the stated rule instead of the real intent.
Multi-turn attacksRisky requests can hide across several messages
Some attacks look harmless one message at a time, but become dangerous when the conversation is assembled.
Technical library
Research reports and deeper technical notes.
For security teams, AI builders, and technical readers who want the details behind the failure modes.
LLM Attack Taxonomy
An interactive map of LLM attack vectors and methods, and how Black Diamond Consulting assesses against each.
Technical ReportMedicaid Fraud Hunter: Investigative Pipeline for Anomalous Medicaid Billing Detection
A self-hosted analytical pipeline processes 617,503 providers and 159 million procedure rows on $237 of commodity hardware, producing attorney-ready PDF dossiers and ranked suspect lists from publicly available HHS Medicaid claims data.
Research ReportComparative Analysis: Claude Haiku 4.5 vs. Gemma 4 E4B-IT vs. LLaMA 3.1 8B — LLM Security Boundary Evaluation
Judge-validated failure rates across three LLM deployment configurations reveal a 61x gap between best and worst performers. Alignment training — not model size or deployment modality — is the critical variable.
HallucinationStale regulatory data: the hallucination that passes every check
Fabricated citations fail the moment you look them up. Stale regulatory data passes every verification check except the one most practitioners skip — and more capable models make the failure harder to catch.
Prompt injectionPrompt augmentation: dual-channel injection attacks
When a hidden image injection is paired with a user message that reinforces or anchors the injected premise, the model receives apparent corroboration from two independent sources—making detection and resistance harder.
Prompt injectionTiny-font injection: hiding instructions at readable contrast
An injection attack that doesn't hide text by color—it hides it by making the text physically impractical for a human reviewer to read while remaining legible to a model's vision system.
Competing objectivesCompeting objectives: when helpfulness becomes a vector
Attacks that exploit the tension between a model's helpfulness directive and its safety constraints, using the model's own values against it.
Prompt injectionHidden-text injection in multimodal upload workflows
When your model accepts image or document uploads, the upload itself becomes an injection surface. How attackers hide instructions in content that looks blank to human reviewers.
Specification gamingSpecification gaming: when your model optimizes the wrong thing
Reward hacking, metric gaming, self-evaluation inflation, and loophole exploitation are failure modes that emerge when a model satisfies the letter of its instructions but not the intent.
System prompt leakageFive ways your system prompt leaks to users
Direct probing, encoded extraction, roleplay attacks, multi-turn escalation, and differential analysis.
HallucinationHallucination in production: when confident is worse than wrong
How LLMs fabricate facts, invent citations, and elaborate on false premises, and why benchmark scores don't predict production behavior.
Prompt injectionIndirect prompt injection via RAG-retrieved documents
How attackers embed malicious instructions in documents your model retrieves, and four specific attack patterns to test for.
SycophancySycophancy is an enterprise liability
When a model tells users what they want to hear instead of what's true, the consequences range from bad advice to legal exposure.
Data exfiltrationCross-user data leakage in multi-tenant LLM deployments
How one user's data can appear in another user's responses, and the test patterns that expose this failure in RAG and memory-augmented systems.
MethodologyEight ways an airline chatbot fails
A taxonomy of failure modes for customer-facing LLMs in regulated, high-stakes deployment contexts.
Refusal calibrationOver-refusal: the other safety failure
When models refuse legitimate medical, security, historical, and creative requests, the safety system is miscalibrated, and the cost is real.
Payload splittingPayload splitting: how harmful requests hide across multiple turns
Attacks that distribute a harmful request across innocuous conversational turns evade single-turn safety filters. Here is the pattern and what it takes to catch it.
JailbreakJailbreak techniques that work on production deployments
Persona attacks, fictional framing, prompt injection, and context manipulation: the attack classes that bypass behavioral constraints in deployed LLMs.
VirtualizationVirtualization attacks: how 'simulation mode' suspends your guardrails
Attackers claim the model is in a special mode where its safety restrictions are lifted. Here is why this works, and how to test whether your model is vulnerable.
Reading is useful. Testing is better.
Want to know which of these risks applies to your AI?
Send a short description of your system and I'll give you a plain-English read on the risks I would check first.
Free · No call required
Not sure what to read first?
Take the 60-second AI risk check. It will help you spot whether your AI has the kinds of exposure these articles are about.