Download PDF version

Black Diamond Consulting LLC | Published: May 2026 | Sean Yunt


Executive Summary

Medicaid Fraud Hunter is a self-hosted analytical pipeline that scans publicly available HHS Medicaid claims data to identify providers exhibiting statistically anomalous billing patterns. It produces ranked suspect lists and evidence-grade PDF dossiers suitable for attorney review and regulatory referral.

The pipeline operates entirely on-premises, requires no cloud services, and processes the full national Medicaid dataset — over 617,000 providers and 159 million procedure rows — on commodity server hardware totaling $237. Per-scan electricity cost runs approximately $0.04, compared to an estimated $0.76 on-demand or $0.23 spot on equivalent AWS infrastructure.

The system implements four independent anomaly detectors — volume impossibility, revenue outlier, billing spike, and suspicious consistency — scored by corroborating evidence count rather than single-detector threshold. Of 250,825 providers analyzed (those billing $100,000 or more total), 79,944 exceeded the detection threshold, with the top suspect scoring 85% across three corroborating detectors.

The pipeline is designed around a specific investigative use case: generating qui tam referrals under the False Claims Act. Outputs are calibrated for attorney review, not automated enforcement — every flagged provider requires human verification before referral action. The current implementation identifies statistically anomalous billing; whether that anomaly represents fraud is a legal determination outside the scope of the tool.


Infrastructure

The pipeline runs on a single workstation located on-premises.

ComponentSpecification
ServerDell Precision T3600 Workstation
CPUIntel Xeon E5-1650 — 6 cores / 12 threads @ 3.2GHz
RAM64GB ECC DDR3 (required for national dataset processing)
GPUNVIDIA GeForce GTX 1060 6GB (CUDA acceleration)
Storage120GB NVMe SSD (OS) + 1TB HDD (data)
OSDebian Linux
RuntimeDocker containers
Total Hardware Cost$237

All pipeline components run as Docker containers with automatic restart on reboot. PDF dossiers are served via an on-premises Nginx file server accessible at http://10.0.0.45:8083/dossiers/. The 64GB ECC RAM is not over-provisioned — Stage 1 preprocessing peaks at approximately 35GB during full-dataset aggregation.


Data Sources

DatasetSourceSizeContents
Medicaid Provider SpendingHHS / data.cms.gov2.8 GBAll Medicaid claims — 617,503 providers, 159M procedure rows
NPPES Provider RegistryCMS National Plan & Provider1 GB (zip)Provider names, addresses, specialties, NPI numbers

Both datasets are publicly available and updated periodically by CMS. The pipeline must be re-run when new data is published — there is currently no automated detection for dataset updates. The NPPES registry is used exclusively for provider identity resolution during the profiling stage; the claims dataset drives all anomaly detection.


Pipeline Architecture

The workflow consists of three sequential stages. Stage 1 runs once per dataset version. Stages 2 and 3 run on demand.

Stage 1 — Preprocess (One-Time)

Reads the raw 2.8GB claims file and aggregates it into two optimized summary tables:

  • provider_monthly.parquet — 202 MB, 20,074,086 rows — monthly billing totals per provider
  • provider_procedure.parquet — 793 MB, 159,437,730 rows — procedure-level detail per provider
MetricValue
Input2.8 GB raw parquet file
Output~1 GB in two summary parquet files
Runtime~6 minutes
FrequencyOnce per dataset version
Peak RAM~35 GB

These files eliminate repeated full-dataset reads on every subsequent scan. Without preprocessing, each scan would re-aggregate 159 million rows from scratch — a cost paid once per dataset version rather than on every investigation.

Stage 2 — Scan (On Demand)

Runs four independent anomaly detectors against all 250,825 qualifying providers (those billing $100,000 or more total). Results are ranked by corroborating evidence score and saved to a CSV file.

DetectorWhat It CatchesMethodNational Flags
Volume ImpossibilityMore than 1,500 claims in a single monthHard threshold59,257 providers
Revenue OutlierAbnormally high revenue per claim vs. peersMedian/MAD comparison27,675 providers
Billing SpikeSudden 5x+ surge vs. provider’s own historyRolling baseline comparison8,478 providers
Suspicious Consistency>90% of bills at identical dollar amountConsistency ratio0 providers*

Suspicious consistency flagged 0 providers at national scale — threshold recalibration is pending.

Scoring logic: providers are scored based on the number of independent detectors that fire. One detector yields a maximum score of 70%; two detectors, 90%; three or more, 100%. This corroboration model reduces single-detector false positives by requiring independent evidence before elevating a provider to high-priority status.

MetricValue
Providers analyzed250,825 (366,678 excluded below $100K threshold)
Providers flagged above 0.3 threshold79,944
Top score85% (3 corroborating detectors)
Runtime~3.5 minutes
Outputscan_results.csv — ranked list with NPI, score, flags

Scans can be filtered by state for targeted investigations: scan.sh --state WV

Stage 3 — Profile (On Demand, Per Provider)

Generates a detailed PDF investigation dossier for a specific provider NPI. This is the primary deliverable for attorney review.

Each dossier contains:

  • Provider identity — name, address, specialty, NPI (from NPPES registry)
  • Claims summary — total claims, total paid, beneficiary count, date range
  • Top procedures by volume and dollar amount
  • Peer comparison — percentile rank and z-score against all 617,503 providers nationally
  • Monthly billing timeline — month-by-month claims and payment totals
  • Detected anomalies with severity and supporting evidence
MetricValue
RuntimeUnder 2 minutes per provider
OutputPDF dossier at /mnt/storage/dossiers/
Accesshttp://10.0.0.45:8083/dossiers/ (local network)

Before profiling, a provider can be identified by name using the NPPES registry lookup: lookup.sh "Acme Health Clinic" --state WV

This returns the provider’s NPI, address, and specialty, which is then passed directly to profile.sh.


End-to-End Investigative Workflows

Reactive investigation (specific provider reported as suspicious):

StepCommandOutput
1. Identify provider NPIlookup.sh "Clinic Name" --state WVNPI number + address
2. Generate dossierprofile.sh <NPI>PDF dossier
3. Review dossierhttp://10.0.0.45:8083/dossiers/Open in browser
4. Refer to attorneyShare PDFQui tam referral

Proactive investigation (national scan-driven):

StepCommandOutput
1. Run national scanscan.shscan_results.csv — 79,944 flagged providers
2. Review top suspectsOpen CSVRanked list with scores and detector flags
3. Profile top candidatesprofile.sh <NPI>PDF dossier per provider
4. Refer to attorneyShare PDFQui tam referral

Key Numbers at a Glance

MetricValue
Total providers in dataset617,503
Providers analyzed (above $100K threshold)250,825
Providers flagged nationally79,944
Top suspect score85% (3 corroborating detectors)
Preprocess runtime~6 minutes (one-time)
Scan runtime~3.5 minutes
Profile runtimeUnder 2 minutes
Total infrastructure cost$237
Estimated AWS equivalent cost per scan~$0.76 on-demand / ~$0.23 spot
On-premises cost per scan (electricity)~$0.04

Current Limitations

Suspect identification is algorithmic — findings require human verification and legal review before any referral action. The pipeline surfaces statistical anomalies, not confirmed fraud.

Three specific limitations are tracked for the current version:

  • Suspicious consistency detector requires threshold recalibration for national scale; currently flags 0 providers.
  • Provider dossier risk score display bug — shows 0% instead of the scan-derived score. Fix pending.
  • No automated alerting when new HHS datasets are published — manual re-run required.
  • Qui tam execution requires partnership with a False Claims Act attorney. The pipeline produces referral-ready dossiers; legal filing requires counsel.

Pending Enhancements

  • Fix dossier risk score to reflect scan-derived score
  • Recalibrate suspicious consistency detector for national scale
  • Add automated dataset update detection
  • Evaluate PySpark scan for parallel multi-state processing

Disclaimer: Medicaid Fraud Hunter processes publicly available government data. All outputs are intended for attorney review and should not be used for enforcement action without independent legal verification. This report is intended for informational purposes only.

Black Diamond Consulting LLC | 11 3rd ST NW #353, Auburn, WA 98071 | blackdiamondconsulting.ai | sean@blackdiamondconsulting.ai