Matching models
A matching model defines how MDMbox compares records to find duplicates. It specifies which fields to compare, how to compare them, and what scores to assign.
MDMbox supports two types of models:
- MatchingModel — for online
$matchqueries against individual resources - BulkMatchingModel — for batch processing across the entire dataset
Both are stored as FHIR resources and managed via the REST API or the Admin UI.
Concepts
Variables
Variables extract values from FHIR resources using SQL expressions. They are referenced by blocks and features.
{
"variable": [
{"name": "dob", "expression": "(#.resource->>'birthDate')"},
{"name": "family", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'family')"}
]
}
The # prefix is replaced with the table alias at query time (l. for the left record, r. for the right).
Blocks
Blocks define how candidate pairs are selected before comparison. Each block is a condition that narrows the search space. Blocks are combined with OR — a pair only needs to match one block.
{
"block": [
{"name": "dob", "variable": "dob"},
{"name": "fn", "variable": "family"}
]
}
A block with "variable": "dob" means "candidates must have the same date of birth". For custom logic, use "sql" instead of "variable":
{"name": "custom", "sql": "l.some_column = r.some_column"}
Features
Features define the comparison logic. Each feature has a list of cases evaluated top to bottom. The first matching case determines the weight (log2 Bayes factor).
{
"feature": [
{
"name": "dob",
"case": [
{"expression": "l.#dob = r.#dob", "weight": 10.59},
{"expression": "levenshtein(l.#dob, r.#dob) <= 1", "weight": 3.99},
{"else": -10.32}
]
}
]
}
The total match score is the sum of all feature weights.
Thresholds
Thresholds classify match results into grades:
{
"thresholds": {
"certain": 25,
"probable": 16
}
}
- Score >=
certain— match gradecertain(high confidence) - Score >=
probable— match gradeprobable(review recommended) - Score <
probable— match gradepossible(only returned if threshold is overridden belowprobable)
MatchingModel
Used by the $match operation for online queries. Works directly against FHIR resource JSONB columns.
Create a model
POST /api/models
Content-Type: application/json
{
"resourceType": "MatchingModel",
"id": "patient-model",
"resource": "Patient",
"thresholds": {
"certain": 25,
"probable": 16
},
"variable": [
{"name": "dob", "expression": "(#.resource->>'birthDate')"},
{"name": "given", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'given')"},
{"name": "family", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'family')"},
{"name": "gender", "expression": "(#.resource->>'gender')"}
],
"block": [
{"name": "dob", "variable": "dob"},
{"name": "fn", "variable": "family"}
],
"feature": [
{
"name": "dob",
"case": [
{"expression": "l.#dob = r.#dob", "weight": 10.59},
{"expression": "levenshtein(l.#dob, r.#dob) <= 1", "weight": 3.99},
{"else": -10.32}
]
},
{
"name": "name",
"case": [
{"expression": "l.#given = r.#given AND l.#family = r.#family", "weight": 13.34},
{"expression": "l.#given = r.#family AND l.#family = r.#given", "weight": 13.10},
{"expression": "l.#family = r.#family", "weight": 2.40},
{"else": -12.37}
]
},
{
"name": "sex",
"case": [
{"expression": "l.#gender = r.#gender", "weight": 1.85},
{"else": -4.84}
]
}
]
}
Model CRUD
| Method | Path | Description |
|---|---|---|
GET | /api/models | List all models (optional ?resource=Patient) |
POST | /api/models | Create a model |
GET | /api/models/:id | Get model by ID |
PUT | /api/models/:id | Update model |
DELETE | /api/models/:id | Delete model |
BulkMatchingModel
Used by the bulk match pipeline. Instead of querying FHIR JSONB at comparison time, it pre-extracts data into typed PostgreSQL columns for faster batch processing.
Key differences from MatchingModel:
columnreplacesvariable— each column has asourceSQL expression and a PostgreSQLtypetableNamespecifies the flat table to createindexdefines indexes to create on the flat table after population- In
blockentries,variablerefers to acolumnname (not a variable — the field name is the same in both model types) - In
featureexpressions, reference column names directly (e.g.,l.dobinstead ofl.#dob)
{
"resourceType": "BulkMatchingModel",
"id": "patient-bulk",
"resource": "Patient",
"tableName": "mdm.patient_flat",
"thresholds": {
"certain": 25,
"probable": 16
},
"column": [
{"name": "dob", "type": "text", "source": "resource->>'birthDate'"},
{"name": "family", "type": "text", "source": "immutable_unaccent_upper(resource->'name'->0->>'family')"}
],
"index": [
{"name": "idx_dob", "column": "dob", "type": "btree"},
{"name": "idx_family", "column": "family", "type": "btree"}
],
"block": [
{"name": "dob", "variable": "dob"},
{"name": "fn", "variable": "family"}
],
"feature": [
{
"name": "dob",
"case": [
{"expression": "l.dob = r.dob", "weight": 10.59},
{"else": -10.32}
]
}
]
}
Admin UI
Models can be managed through the Admin UI at /admin. The UI provides a JSON editor for creating and editing both MatchingModel and BulkMatchingModel resources.
Tuning
The example weights above are for demonstration. Production deployment requires calibrating weights based on your data. Key considerations:
- Bayes factors reflect the discriminative power of each comparison. Higher absolute values mean the feature is more decisive.
- Thresholds control the tradeoff between precision and recall. Lower thresholds catch more duplicates but increase false positives.
- Blocks determine which pairs are compared. Too broad blocks slow down matching; too narrow blocks miss duplicates.
- Database indexes are critical for performance with large datasets. Create indexes on columns used in blocks.
Professional tuning services are available. Contact Health Samurai for assistance with model calibration using machine learning and expert analysis.