Matching models

A matching model defines how MDMbox compares records to find duplicates. It specifies which fields to compare, how to compare them, and what scores to assign.

MDMbox supports two types of models:

MatchingModel — for online $match queries against individual resources
BulkMatchingModel — for batch processing across the entire dataset

Both are stored as FHIR resources and managed via the REST API or the Admin UI.

Concepts

Variables

Variables extract values from FHIR resources using SQL expressions. They are referenced by blocks and features.

{
  "variable": [
    {"name": "dob", "expression": "(#.resource->>'birthDate')"},
    {"name": "family", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'family')"}
  ]
}

The # prefix is replaced with the table alias at query time (l. for the left record, r. for the right).

Blocks

Blocks define how candidate pairs are selected before comparison. Each block is a condition that narrows the search space. Blocks are combined with OR — a pair only needs to match one block.

{
  "block": [
    {"name": "dob", "variable": "dob"},
    {"name": "fn", "variable": "family"}
  ]
}

A block with "variable": "dob" means "candidates must have the same date of birth". For custom logic, use "sql" instead of "variable":

{"name": "custom", "sql": "l.some_column = r.some_column"}

Features

Features define the comparison logic. Each feature has a list of cases evaluated top to bottom. The first matching case determines the weight (log2 Bayes factor).

{
  "feature": [
    {
      "name": "dob",
      "case": [
        {"expression": "l.#dob = r.#dob", "weight": 10.59},
        {"expression": "levenshtein(l.#dob, r.#dob) <= 1", "weight": 3.99},
        {"else": -10.32}
      ]
    }
  ]
}

The total match score is the sum of all feature weights.

Thresholds

Thresholds classify match results into grades:

{
  "thresholds": {
    "certain": 25,
    "probable": 16
  }
}

Score >= certain — match grade certain (high confidence)
Score >= probable — match grade probable (review recommended)
Score < probable — match grade possible (only returned if threshold is overridden below probable)

MatchingModel

Used by the $match operation for online queries. Works directly against FHIR resource JSONB columns.

Create a model

POST https://<mdmbox-host>/api/models
Content-Type: application/json

{
  "resourceType": "MatchingModel",
  "id": "patient-model",
  "resource": "Patient",
  "thresholds": {
    "certain": 25,
    "probable": 16
  },
  "variable": [
    {"name": "dob", "expression": "(#.resource->>'birthDate')"},
    {"name": "given", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'given')"},
    {"name": "family", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'family')"},
    {"name": "gender", "expression": "(#.resource->>'gender')"}
  ],
  "block": [
    {"name": "dob", "variable": "dob"},
    {"name": "fn", "variable": "family"}
  ],
  "feature": [
    {
      "name": "dob",
      "case": [
        {"expression": "l.#dob = r.#dob", "weight": 10.59},
        {"expression": "levenshtein(l.#dob, r.#dob) <= 1", "weight": 3.99},
        {"else": -10.32}
      ]
    },
    {
      "name": "name",
      "case": [
        {"expression": "l.#given = r.#given AND l.#family = r.#family", "weight": 13.34},
        {"expression": "l.#given = r.#family AND l.#family = r.#given", "weight": 13.10},
        {"expression": "l.#family = r.#family", "weight": 2.40},
        {"else": -12.37}
      ]
    },
    {
      "name": "sex",
      "case": [
        {"expression": "l.#gender = r.#gender", "weight": 1.85},
        {"else": -4.84}
      ]
    }
  ]
}

Model CRUD

Method	Path	Description
`GET`	`/api/models`	List all models (optional `?resource=Patient`)
`POST`	`/api/models`	Create a model
`GET`	`/api/models/:id`	Get model by ID
`PUT`	`/api/models/:id`	Update model
`DELETE`	`/api/models/:id`	Delete model

BulkMatchingModel

Used by the bulk match pipeline. Instead of querying FHIR JSONB at comparison time, it pre-extracts data into typed PostgreSQL columns for faster batch processing.

Key differences from MatchingModel:

column replaces variable — each column has a source SQL expression and a PostgreSQL type
tableName specifies the flat table to create
index defines indexes to create on the flat table after population
In block entries, variable refers to a column name (not a variable — the field name is the same in both model types)
In feature expressions, reference column names directly (e.g., l.dob instead of l.#dob)

{
  "resourceType": "BulkMatchingModel",
  "id": "patient-bulk",
  "resource": "Patient",
  "tableName": "mdm.patient_flat",
  "thresholds": {
    "certain": 25,
    "probable": 16
  },
  "column": [
    {"name": "dob", "type": "text", "source": "resource->>'birthDate'"},
    {"name": "family", "type": "text", "source": "immutable_unaccent_upper(resource->'name'->0->>'family')"}
  ],
  "index": [
    {"name": "idx_dob", "column": "dob", "type": "btree"},
    {"name": "idx_family", "column": "family", "type": "btree"}
  ],
  "block": [
    {"name": "dob", "variable": "dob"},
    {"name": "fn", "variable": "family"}
  ],
  "feature": [
    {
      "name": "dob",
      "case": [
        {"expression": "l.dob = r.dob", "weight": 10.59},
        {"else": -10.32}
      ]
    }
  ]
}

Admin UI

Models can be managed through the Admin UI at /admin. The UI provides a JSON editor for creating and editing both MatchingModel and BulkMatchingModel resources.

Tuning

The example weights above are for demonstration. Production deployment requires calibrating weights based on your data. Key considerations:

Bayes factors reflect the discriminative power of each comparison. Higher absolute values mean the feature is more decisive.
Thresholds control the tradeoff between precision and recall. Lower thresholds catch more duplicates but increase false positives.
Blocks determine which pairs are compared. Too broad blocks slow down matching; too narrow blocks miss duplicates.
Database indexes are critical for performance with large datasets. Create indexes on columns used in blocks.

Professional tuning services are available. Contact Health Samurai for assistance with model calibration using machine learning and expert analysis.

Next steps

Find duplicates: $match

Bulk matching

Mathematical details