Medicaid Fraud Hunter: Investigative Pipeline for Anomalous Medicaid Billing Detection

Black Diamond Consulting LLC | Published: May 2026 | Sean Yunt

Executive Summary

Medicaid Fraud Hunter is a self-hosted analytical pipeline that scans publicly available HHS Medicaid claims data to identify providers exhibiting statistically anomalous billing patterns. It produces ranked suspect lists and evidence-grade PDF dossiers suitable for attorney review and regulatory referral.

The pipeline operates entirely on-premises, requires no cloud services, and processes the full national Medicaid dataset — over 617,000 providers and 159 million procedure rows — on commodity server hardware totaling $237. Per-scan electricity cost runs approximately $0.04, compared to an estimated $0.76 on-demand or $0.23 spot on equivalent AWS infrastructure.

The system implements four independent anomaly detectors — volume impossibility, revenue outlier, billing spike, and suspicious consistency — scored by corroborating evidence count rather than single-detector threshold. Of 250,825 providers analyzed (those billing $100,000 or more total), 79,944 exceeded the detection threshold, with the top suspect scoring 85% across three corroborating detectors.

The pipeline is designed around a specific investigative use case: generating qui tam referrals under the False Claims Act. Outputs are calibrated for attorney review, not automated enforcement — every flagged provider requires human verification before referral action. The current implementation identifies statistically anomalous billing; whether that anomaly represents fraud is a legal determination outside the scope of the tool.

Infrastructure

The pipeline runs on a single workstation located on-premises.

Component	Specification
Server	Dell Precision T3600 Workstation
CPU	Intel Xeon E5-1650 — 6 cores / 12 threads @ 3.2GHz
RAM	64GB ECC DDR3 (required for national dataset processing)
GPU	NVIDIA GeForce GTX 1060 6GB (CUDA acceleration)
Storage	120GB NVMe SSD (OS) + 1TB HDD (data)
OS	Debian Linux
Runtime	Docker containers
Total Hardware Cost	$237

All pipeline components run as Docker containers with automatic restart on reboot. PDF dossiers are served via an on-premises Nginx file server accessible at http://10.0.0.45:8083/dossiers/. The 64GB ECC RAM is not over-provisioned — Stage 1 preprocessing peaks at approximately 35GB during full-dataset aggregation.

Data Sources

Dataset	Source	Size	Contents
Medicaid Provider Spending	HHS / data.cms.gov	2.8 GB	All Medicaid claims — 617,503 providers, 159M procedure rows
NPPES Provider Registry	CMS National Plan & Provider	1 GB (zip)	Provider names, addresses, specialties, NPI numbers

Both datasets are publicly available and updated periodically by CMS. The pipeline must be re-run when new data is published — there is currently no automated detection for dataset updates. The NPPES registry is used exclusively for provider identity resolution during the profiling stage; the claims dataset drives all anomaly detection.

Pipeline Architecture

The workflow consists of three sequential stages. Stage 1 runs once per dataset version. Stages 2 and 3 run on demand.

Stage 1 — Preprocess (One-Time)

Reads the raw 2.8GB claims file and aggregates it into two optimized summary tables:

provider_monthly.parquet — 202 MB, 20,074,086 rows — monthly billing totals per provider
provider_procedure.parquet — 793 MB, 159,437,730 rows — procedure-level detail per provider

Metric	Value
Input	2.8 GB raw parquet file
Output	~1 GB in two summary parquet files
Runtime	~6 minutes
Frequency	Once per dataset version
Peak RAM	~35 GB

These files eliminate repeated full-dataset reads on every subsequent scan. Without preprocessing, each scan would re-aggregate 159 million rows from scratch — a cost paid once per dataset version rather than on every investigation.

Stage 2 — Scan (On Demand)

Runs four independent anomaly detectors against all 250,825 qualifying providers (those billing $100,000 or more total). Results are ranked by corroborating evidence score and saved to a CSV file.

Detector	What It Catches	Method	National Flags
Volume Impossibility	More than 1,500 claims in a single month	Hard threshold	59,257 providers
Revenue Outlier	Abnormally high revenue per claim vs. peers	Median/MAD comparison	27,675 providers
Billing Spike	Sudden 5x+ surge vs. provider’s own history	Rolling baseline comparison	8,478 providers
Suspicious Consistency	>90% of bills at identical dollar amount	Consistency ratio	0 providers*

Suspicious consistency flagged 0 providers at national scale — threshold recalibration is pending.

Scoring logic: providers are scored based on the number of independent detectors that fire. One detector yields a maximum score of 70%; two detectors, 90%; three or more, 100%. This corroboration model reduces single-detector false positives by requiring independent evidence before elevating a provider to high-priority status.

Metric	Value
Providers analyzed	250,825 (366,678 excluded below $100K threshold)
Providers flagged above 0.3 threshold	79,944
Top score	85% (3 corroborating detectors)
Runtime	~3.5 minutes
Output	`scan_results.csv` — ranked list with NPI, score, flags

Scans can be filtered by state for targeted investigations: scan.sh --state WV

Stage 3 — Profile (On Demand, Per Provider)

Generates a detailed PDF investigation dossier for a specific provider NPI. This is the primary deliverable for attorney review.

Each dossier contains:

Provider identity — name, address, specialty, NPI (from NPPES registry)
Claims summary — total claims, total paid, beneficiary count, date range
Top procedures by volume and dollar amount
Peer comparison — percentile rank and z-score against all 617,503 providers nationally
Monthly billing timeline — month-by-month claims and payment totals
Detected anomalies with severity and supporting evidence

Metric	Value
Runtime	Under 2 minutes per provider
Output	PDF dossier at `/mnt/storage/dossiers/`
Access	`http://10.0.0.45:8083/dossiers/` (local network)

Before profiling, a provider can be identified by name using the NPPES registry lookup: lookup.sh "Acme Health Clinic" --state WV

This returns the provider’s NPI, address, and specialty, which is then passed directly to profile.sh.

End-to-End Investigative Workflows

Reactive investigation (specific provider reported as suspicious):

Step	Command	Output
1. Identify provider NPI	`lookup.sh "Clinic Name" --state WV`	NPI number + address
2. Generate dossier	`profile.sh <NPI>`	PDF dossier
3. Review dossier	`http://10.0.0.45:8083/dossiers/`	Open in browser
4. Refer to attorney	Share PDF	Qui tam referral

Proactive investigation (national scan-driven):

Step	Command	Output
1. Run national scan	`scan.sh`	`scan_results.csv` — 79,944 flagged providers
2. Review top suspects	Open CSV	Ranked list with scores and detector flags
3. Profile top candidates	`profile.sh <NPI>`	PDF dossier per provider
4. Refer to attorney	Share PDF	Qui tam referral

Key Numbers at a Glance

Metric	Value
Total providers in dataset	617,503
Providers analyzed (above $100K threshold)	250,825
Providers flagged nationally	79,944
Top suspect score	85% (3 corroborating detectors)
Preprocess runtime	~6 minutes (one-time)
Scan runtime	~3.5 minutes
Profile runtime	Under 2 minutes
Total infrastructure cost	$237
Estimated AWS equivalent cost per scan	~$0.76 on-demand / ~$0.23 spot
On-premises cost per scan (electricity)	~$0.04

Current Limitations

Suspect identification is algorithmic — findings require human verification and legal review before any referral action. The pipeline surfaces statistical anomalies, not confirmed fraud.

Three specific limitations are tracked for the current version:

Suspicious consistency detector requires threshold recalibration for national scale; currently flags 0 providers.
Provider dossier risk score display bug — shows 0% instead of the scan-derived score. Fix pending.
No automated alerting when new HHS datasets are published — manual re-run required.
Qui tam execution requires partnership with a False Claims Act attorney. The pipeline produces referral-ready dossiers; legal filing requires counsel.

Pending Enhancements

Fix dossier risk score to reflect scan-derived score
Recalibrate suspicious consistency detector for national scale
Add automated dataset update detection
Evaluate PySpark scan for parallel multi-state processing

Disclaimer: Medicaid Fraud Hunter processes publicly available government data. All outputs are intended for attorney review and should not be used for enforcement action without independent legal verification. This report is intended for informational purposes only.

Black Diamond Consulting LLC | 11 3rd ST NW #353, Auburn, WA 98071 | blackdiamondconsulting.ai | sean@blackdiamondconsulting.ai