When a model invents a citation, the fabrication is detectable: the paper doesn’t exist, the DOI resolves to nothing, the journal never published that volume. Stale regulatory data is a different failure mode entirely. The rule existed. The threshold was real. The regulatory body is legitimate. Every surface check confirms the information — because it was once correct. What the check doesn’t catch is that the rule changed.

This distinction matters operationally. Teams that test for hallucination by verifying whether cited sources exist will catch fabrication. They will not catch the model stating the pre-2018 Dodd-Frank SIFI threshold with the same confidence it had before Congress raised it.

CFTC Staff Advisory Letter 24-17 (December 5, 2024) explicitly identifies AI hallucination on stale regulatory data as a documented risk category in financial services. FINRA Regulatory Notice 24-09 extends suitability obligations to AI chatbot outputs. Neither regulator is treating this as a theoretical concern.

Fabrication versus staleness

Fabricated hallucination and temporal staleness produce outputs that look identical. Both are confident. Both are formatted correctly. Both cite real-sounding regulatory language. The difference is in the failure mechanism.

A fabricated citation has no referent. It fails on contact with reality: look it up and it’s gone. This is detectable with external verification, and practitioners in high-stakes domains have learned to check.

Stale data has a referent — a superseded one. The $50 billion consolidated asset threshold for automatic SIFI designation under Dodd-Frank is real law. It appears in the original 2010 statute. A model that states it is citing something that genuinely existed. What the model is not stating is that the Economic Growth, Regulatory Relief, and Consumer Protection Act (S.2155) raised that threshold to $250 billion in 2018. The 2010 number passes every check designed to verify existence. The check that would catch it — is this still the operative threshold? — is the one most practitioners skip.

Why hedge language is not a safeguard

Models trained to hedge will append disclaimers to stale outputs: “regulatory thresholds may have changed, please verify with current sources.” This is commonly interpreted as the model being careful. It is not being careful. It is stating an outdated primary answer with plausible deniability attached.

The S.2155 case illustrates why. A model that says “the Dodd-Frank SIFI threshold is $50 billion — regulatory requirements may have changed, please verify” has told the user the wrong number while technically noting that numbers change. The hedge doesn’t name S.2155. It doesn’t flag that the threshold doubled. It doesn’t give the user any information that would help them identify which specific requirement to verify. It reads as caution; it functions as cover.

In a practical compliance or risk workflow, a practitioner reading that output has received a specific number and a generic disclaimer. The number will anchor subsequent analysis. The disclaimer will be read as standard boilerplate. Stale information delivered with a hedge is still stale information — it’s just delivered in a way that shifts responsibility to the reader without equipping them to catch the error.

The description-without-name blind spot

The second evasion is structural. A model describing a regulatory rule by substance, without naming its citation, bypasses every string-matching heuristic designed to catch outdated references.

SEC Rule S7-12-23 — the proposed rule on conflicts of interest in predictive data analytics — was proposed in July 2023 and withdrawn June 12, 2025. When three models were queried about it, all three treated it as an open question: describing competing considerations, noting ongoing regulatory debate, presenting the rule’s current status as unresolved. None gave the correct answer, which is that the rulemaking is closed.

The heuristic flag did not fire on Opus. Not because Opus was more accurate — it wasn’t — but because Opus described the rule’s substance, stakeholder positions, and analytical framework without ever using the rule number. A checker looking for “S7-12-23” found nothing to flag. The richest, most detailed, most confidence-inspiring response was also the one the heuristic missed entirely.

This matters because the most defensively phrased outputs are disproportionately the ones that describe rather than cite. The model is more likely to hedge when it is less certain, and also more likely to reach for substance over citation when it is working at the edge of its knowledge. The outputs that evade detection are the ones most likely to be wrong.

Why more capable models produce more dangerous failures

The S7-12-23 result is not an anomaly. It reflects a systematic relationship between model capability and detection difficulty.

A less capable model that gives a stale answer typically does so in a way that signals uncertainty: the answer is thin, the framing is simple, the confidence is visibly miscalibrated. A more capable model that gives a stale answer does so in a way that signals expertise: rich context, accurate background, balanced framing, and language that reads as authoritative current analysis. The content is wrong; the presentation is exactly what a correct answer would look like.

“Didn’t say the wrong rule number” is not the same as “gave correct current information.” A model that never names the rule it’s describing cannot be caught by any heuristic that checks rule names. The failure is invisible to the detection method precisely because the model is capable enough to avoid the telltale markers.

This inverts the intuition that capability reduces risk. In the domain of temporal regulatory accuracy, capability can increase the severity of failures that get through — not because capable models are more often wrong, but because when they are wrong, they are wrong in ways that are harder to detect and more likely to be acted on.

Dual-channel detection

The response to this failure class is not to abandon heuristics. It is to combine them with a detection method that catches what heuristics can’t see.

Heuristic checking is fast, deterministic, and effective against named-rule failures: a system that flags any response containing “S7-12-23” as requiring status verification will catch a large fraction of temporal staleness errors on that rule. It is cheap to run and produces no false negatives on the pattern it covers.

LLM-as-judge operates differently. A second model, given the response and a question about regulatory currency, can assess whether the answer treats a resolved matter as open, whether the described timeline is consistent with current status, and whether the framing would mislead a practitioner without surfacing any flaggable citation. It catches nuanced failures that heuristics cannot see — including the description-without-name pattern that Opus exploited in the S7-12-23 test.

The two channels will disagree. That disagreement is intentional. When the heuristic clears an output and the LLM-as-judge flags it, or vice versa, the disagreement is not a calibration problem to be resolved — it is precisely where the real findings live. Responses that clear both channels are low risk. Responses that fail both channels are high risk. Responses that split the channels are the cases worth examining.

This methodology and test cases are documented at https://github.com/SeanYunt/llm-adversarial-eval, including the S7-12-23 prompts, model outputs, and scoring rationale from the three-model comparison.

Regulatory context

Three anchors are relevant for teams operationalizing this:

CFTC Staff Advisory Letter 24-17 (December 5, 2024) explicitly identifies AI hallucination on stale regulatory data as a documented risk category. This is not a general warning about AI accuracy — it is a specific identification of temporal staleness as a risk that financial services firms are expected to address.

FINRA Regulatory Notice 24-09 applies suitability obligations to AI chatbot outputs. An AI system that provides stale regulatory guidance to customers is not absolved of suitability obligations because it added a disclaimer. The standard applies to the output.

S.2155 (Economic Growth, Regulatory Relief, and Consumer Protection Act, 2018) is the highest-value single test case for Dodd-Frank temporal accuracy. The shift from $50B to $250B is substantial, well-documented, and frequently missed by models trained on pre-2018 material or that weight older regulatory sources more heavily.

SEC Rule S7-12-23 (proposed July 2023, withdrawn June 12, 2025) is the highest-value test case for the description-without-name pattern. Any model that treats the rulemaking as ongoing after June 2025 is demonstrating temporal staleness. Any model that describes the rule’s substance without naming it is demonstrating the evasion pattern.

What to test for

Effective evaluation for temporal staleness requires test cases specifically designed around regulatory changes after a plausible training cutoff. The test isn’t whether the model produces a valid-sounding regulatory answer. The test is whether the model produces the current answer, with the correct effective date, on rules that have changed.

Run both detection channels on every response. Treat channel disagreement as a finding, not a calibration error. Treat rich, well-contextualized, number-free responses with the same suspicion as thin, uncertain ones — they are harder to catch and more likely to be acted on.

Hedge language is not evidence of accuracy. A model that appends “please verify” to every output will pass a check for epistemic humility while failing a check for temporal accuracy. The two are not the same, and testing for one does not test for the other.