The National Provider Identifier (NPI) is used across healthcare in the US. It's a unique 10-digit number issued to providers, used in contracts, billing transactions, prescriptions, EHRs, and interoperability between systems. Health plans and clearinghouses are mandated by HIPAA to use it to identify providers.
An NPI is permanent, and is supposed to identify a provider even when they change their name, job, or specialization.
However, we found that thousands of providers have duplicate records, and thus multiple NPIs. This can compromise billing, reporting, and overall data integrity, and may lead to denied claims or misattributed care.
How We Found the Duplicates
Master data management (MDM) solutions help ensure consistent and accurate identity data. They perform identity resolution and matching to find duplicates.
For this analysis, we used our product MDMbox — an MDM & eMPI with probabilistic matching that's able to find duplicates across millions of records in parallel. We downloaded the complete set of NPI records (which is published by CMS here), filtered out organizations and kept only individual providers, and fed that to MDMbox. We trained our ML model on the data, and then ran a bulk match operation. The full matching run completed in less than an hour.
Out of over 7 million records in total, we found 1000 duplicates (>90% probability) and 3500 potential duplicates (70-89% probability).
Most of these duplicates were not caused by fraud or bad data imports. They were operational artifacts accumulated over years: providers moving between organizations, re-registering instead of updating records, inconsistent naming conventions, and manual data entry differences across systems.
Many of these records would also be difficult to detect with exact matching alone. The duplicates only become visible when multiple weak identity signals are evaluated together.
What the Duplicates Look Like
The duplicates fall into several categories:
Exact matches. Some providers have two records with identical names and addresses but different NPIs. Every field is the same, yet they were issued separate identifiers. This kind of match is the easiest, and the resulting records get the highest weight.
Same provider, different addresses. A provider moves to another state and registers again instead of updating their existing record. The addresses differ, but a combination of name, specialization, and phone number confirm it's the same person, thanks to how the matching model is configured.
Name variations. A provider's name is recorded differently across records. For instance, a last name might appear with or without a prefix ("Cruz" vs. "de la Cruz"), or a first name might be shortened. The ML algorithm we used recognizes prefixes and alternative variants of names.
Typos. Data entry errors create duplicates that are hard to catch with exact matching. A single misspelled letter in a name is enough to generate a second record. Probabilistic matching catches these because it weighs all attributes together, not just the name. And if names are similar enough, typos aren’t a problem.
Partial attribute overlap. Two records share a name and medical license number but have different addresses. Or the address changed but the phone number stayed the same. Probabilistic matching picks up on these cross-attribute signals.
False Positives: The Family Practice Problem
Not every match is a true duplicate. We initially had hundreds of false positives from a pattern we didn't expect: doctors who give their children the same name and employ them at the same clinic. These parent-child (and sometimes multi-generational) pairs share the same name and address but have different name suffixes: "Sr.", "Jr.", "III", "IV." We had to add suffix comparison to the matching rules to filter these out.
Explore the Duplicates Yourself
We built an interactive tool where you can search for any provider and see their potential duplicates.

You can enter practitioner data such as name, address, phone number, or use an NPI ID to get a list of potential duplicates. The tool also suggests a few interesting examples to get you started.

You can also adjust the threshold—the cutoff value for displaying matches—to make the search more strict or more fuzzy, and see more or fewer potential duplicates.

In addition, you can also review the list of known duplicates on the Known Matches tab. We’ve included all identified practitioner pairs with a match probability greater than 90%. You can also see how the match score is composed. Click on a row to open a chart that shows each field’s contribution to the total score.

What This Reveals About Healthcare Identity Data
Duplicate records can lead to inaccurate claims, compromise interoperability between healthcare systems, and affect analytics and reporting. Even in a system built around a national unique identifier, identity fragmentation still happens. In practice, healthcare identity data evolves through operational workflows, migrations, staffing changes, and human data entry. Exact identifiers help, but they do not eliminate duplication problems entirely.
The same patterns appear across many healthcare datasets, including patient registries, provider networks, and organizational records. Detecting them usually requires probabilistic matching approaches that evaluate multiple weak identity signals together rather than relying only on exact identifiers.
The analysis presented in this article was performed using MDMbox, our Master Data Management and Enterprise Master Patient Index platform for healthcare identity resolution and duplicate detection. The same matching approach can be applied to patients, practitioners, organizations, and other large healthcare datasets where identity quality is critical.
If you'd like to explore the dataset yourself, you can use the interactive duplicate explorer.





