Black Diamond Consulting LLC | Published: May 2026 | Sean Yunt
Executive Summary
Medicaid Fraud Hunter is a self-hosted analytical pipeline that scans publicly available HHS Medicaid claims data to identify providers exhibiting statistically anomalous billing patterns. It produces ranked suspect lists and evidence-grade PDF dossiers suitable for attorney review and regulatory referral.
The pipeline operates entirely on-premises, requires no cloud services, and processes the full national Medicaid dataset — over 617,000 providers and 159 million procedure rows — on commodity server hardware totaling $237. Per-scan electricity cost runs approximately $0.04, compared to an estimated $0.76 on-demand or $0.23 spot on equivalent AWS infrastructure.
The system implements four independent anomaly detectors — volume impossibility, revenue outlier, billing spike, and suspicious consistency — scored by corroborating evidence count rather than single-detector threshold. Of 250,825 providers analyzed (those billing $100,000 or more total), 79,944 exceeded the detection threshold, with the top suspect scoring 85% across three corroborating detectors.
The pipeline is designed around a specific investigative use case: generating qui tam referrals under the False Claims Act. Outputs are calibrated for attorney review, not automated enforcement — every flagged provider requires human verification before referral action. The current implementation identifies statistically anomalous billing; whether that anomaly represents fraud is a legal determination outside the scope of the tool.
Infrastructure
The pipeline runs on a single workstation located on-premises.
| Component | Specification |
|---|---|
| Server | Dell Precision T3600 Workstation |
| CPU | Intel Xeon E5-1650 — 6 cores / 12 threads @ 3.2GHz |
| RAM | 64GB ECC DDR3 (required for national dataset processing) |
| GPU | NVIDIA GeForce GTX 1060 6GB (CUDA acceleration) |
| Storage | 120GB NVMe SSD (OS) + 1TB HDD (data) |
| OS | Debian Linux |
| Runtime | Docker containers |
| Total Hardware Cost | $237 |
All pipeline components run as Docker containers with automatic restart on reboot. PDF dossiers are served via an on-premises Nginx file server accessible at http://10.0.0.45:8083/dossiers/. The 64GB ECC RAM is not over-provisioned — Stage 1 preprocessing peaks at approximately 35GB during full-dataset aggregation.
Data Sources
| Dataset | Source | Size | Contents |
|---|---|---|---|
| Medicaid Provider Spending | HHS / data.cms.gov | 2.8 GB | All Medicaid claims — 617,503 providers, 159M procedure rows |
| NPPES Provider Registry | CMS National Plan & Provider | 1 GB (zip) | Provider names, addresses, specialties, NPI numbers |
Both datasets are publicly available and updated periodically by CMS. The pipeline must be re-run when new data is published — there is currently no automated detection for dataset updates. The NPPES registry is used exclusively for provider identity resolution during the profiling stage; the claims dataset drives all anomaly detection.
Pipeline Architecture
The workflow consists of three sequential stages. Stage 1 runs once per dataset version. Stages 2 and 3 run on demand.
Stage 1 — Preprocess (One-Time)
Reads the raw 2.8GB claims file and aggregates it into two optimized summary tables:
provider_monthly.parquet— 202 MB, 20,074,086 rows — monthly billing totals per providerprovider_procedure.parquet— 793 MB, 159,437,730 rows — procedure-level detail per provider
| Metric | Value |
|---|---|
| Input | 2.8 GB raw parquet file |
| Output | ~1 GB in two summary parquet files |
| Runtime | ~6 minutes |
| Frequency | Once per dataset version |
| Peak RAM | ~35 GB |
These files eliminate repeated full-dataset reads on every subsequent scan. Without preprocessing, each scan would re-aggregate 159 million rows from scratch — a cost paid once per dataset version rather than on every investigation.
Stage 2 — Scan (On Demand)
Runs four independent anomaly detectors against all 250,825 qualifying providers (those billing $100,000 or more total). Results are ranked by corroborating evidence score and saved to a CSV file.
| Detector | What It Catches | Method | National Flags |
|---|---|---|---|
| Volume Impossibility | More than 1,500 claims in a single month | Hard threshold | 59,257 providers |
| Revenue Outlier | Abnormally high revenue per claim vs. peers | Median/MAD comparison | 27,675 providers |
| Billing Spike | Sudden 5x+ surge vs. provider’s own history | Rolling baseline comparison | 8,478 providers |
| Suspicious Consistency | >90% of bills at identical dollar amount | Consistency ratio | 0 providers* |
Suspicious consistency flagged 0 providers at national scale — threshold recalibration is pending.
Scoring logic: providers are scored based on the number of independent detectors that fire. One detector yields a maximum score of 70%; two detectors, 90%; three or more, 100%. This corroboration model reduces single-detector false positives by requiring independent evidence before elevating a provider to high-priority status.
| Metric | Value |
|---|---|
| Providers analyzed | 250,825 (366,678 excluded below $100K threshold) |
| Providers flagged above 0.3 threshold | 79,944 |
| Top score | 85% (3 corroborating detectors) |
| Runtime | ~3.5 minutes |
| Output | scan_results.csv — ranked list with NPI, score, flags |
Scans can be filtered by state for targeted investigations: scan.sh --state WV
Stage 3 — Profile (On Demand, Per Provider)
Generates a detailed PDF investigation dossier for a specific provider NPI. This is the primary deliverable for attorney review.
Each dossier contains:
- Provider identity — name, address, specialty, NPI (from NPPES registry)
- Claims summary — total claims, total paid, beneficiary count, date range
- Top procedures by volume and dollar amount
- Peer comparison — percentile rank and z-score against all 617,503 providers nationally
- Monthly billing timeline — month-by-month claims and payment totals
- Detected anomalies with severity and supporting evidence
| Metric | Value |
|---|---|
| Runtime | Under 2 minutes per provider |
| Output | PDF dossier at /mnt/storage/dossiers/ |
| Access | http://10.0.0.45:8083/dossiers/ (local network) |
Before profiling, a provider can be identified by name using the NPPES registry lookup:
lookup.sh "Acme Health Clinic" --state WV
This returns the provider’s NPI, address, and specialty, which is then passed directly to profile.sh.
End-to-End Investigative Workflows
Reactive investigation (specific provider reported as suspicious):
| Step | Command | Output |
|---|---|---|
| 1. Identify provider NPI | lookup.sh "Clinic Name" --state WV | NPI number + address |
| 2. Generate dossier | profile.sh <NPI> | PDF dossier |
| 3. Review dossier | http://10.0.0.45:8083/dossiers/ | Open in browser |
| 4. Refer to attorney | Share PDF | Qui tam referral |
Proactive investigation (national scan-driven):
| Step | Command | Output |
|---|---|---|
| 1. Run national scan | scan.sh | scan_results.csv — 79,944 flagged providers |
| 2. Review top suspects | Open CSV | Ranked list with scores and detector flags |
| 3. Profile top candidates | profile.sh <NPI> | PDF dossier per provider |
| 4. Refer to attorney | Share PDF | Qui tam referral |
Key Numbers at a Glance
| Metric | Value |
|---|---|
| Total providers in dataset | 617,503 |
| Providers analyzed (above $100K threshold) | 250,825 |
| Providers flagged nationally | 79,944 |
| Top suspect score | 85% (3 corroborating detectors) |
| Preprocess runtime | ~6 minutes (one-time) |
| Scan runtime | ~3.5 minutes |
| Profile runtime | Under 2 minutes |
| Total infrastructure cost | $237 |
| Estimated AWS equivalent cost per scan | ~$0.76 on-demand / ~$0.23 spot |
| On-premises cost per scan (electricity) | ~$0.04 |
Current Limitations
Suspect identification is algorithmic — findings require human verification and legal review before any referral action. The pipeline surfaces statistical anomalies, not confirmed fraud.
Three specific limitations are tracked for the current version:
- Suspicious consistency detector requires threshold recalibration for national scale; currently flags 0 providers.
- Provider dossier risk score display bug — shows 0% instead of the scan-derived score. Fix pending.
- No automated alerting when new HHS datasets are published — manual re-run required.
- Qui tam execution requires partnership with a False Claims Act attorney. The pipeline produces referral-ready dossiers; legal filing requires counsel.
Pending Enhancements
- Fix dossier risk score to reflect scan-derived score
- Recalibrate suspicious consistency detector for national scale
- Add automated dataset update detection
- Evaluate PySpark scan for parallel multi-state processing
Disclaimer: Medicaid Fraud Hunter processes publicly available government data. All outputs are intended for attorney review and should not be used for enforcement action without independent legal verification. This report is intended for informational purposes only.
Black Diamond Consulting LLC | 11 3rd ST NW #353, Auburn, WA 98071 | blackdiamondconsulting.ai | sean@blackdiamondconsulting.ai