---
description: Create and configure matching models that define how MDMbox compares FHIR resources to find duplicates.
---

# Matching models

A matching model defines how MDMbox compares records to find duplicates. It specifies which fields to compare, how to compare them, and what scores to assign.

MDMbox supports two types of models:

- **MatchingModel** — for online `$match` queries against individual resources
- **BulkMatchingModel** — for batch processing across the entire dataset

Both are stored as FHIR resources and managed via the REST API or the Admin UI.

## Concepts

### Variables

Variables extract values from FHIR resources using SQL expressions. They are referenced by blocks and features.

```json
{
  "variable": [
    {"name": "dob", "expression": "(#.resource->>'birthDate')"},
    {"name": "family", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'family')"}
  ]
}
```

The `#` prefix is replaced with the table alias at query time (`l.` for the left record, `r.` for the right).

### Blocks

Blocks define how candidate pairs are selected before comparison. Each block is a condition that narrows the search space. Blocks are combined with OR — a pair only needs to match one block.

```json
{
  "block": [
    {"name": "dob", "variable": "dob"},
    {"name": "fn", "variable": "family"}
  ]
}
```

A block with `"variable": "dob"` means "candidates must have the same date of birth". For custom logic, use `"sql"` instead of `"variable"`:

```json
{"name": "custom", "sql": "l.some_column = r.some_column"}
```

### Features

Features define the comparison logic. Each feature has a list of cases evaluated top to bottom. The first matching case determines the weight (log2 Bayes factor).

```json
{
  "feature": [
    {
      "name": "dob",
      "case": [
        {"expression": "l.#dob = r.#dob", "weight": 10.59},
        {"expression": "levenshtein(l.#dob, r.#dob) <= 1", "weight": 3.99},
        {"else": -10.32}
      ]
    }
  ]
}
```

The total match score is the sum of all feature weights.

### Thresholds

Thresholds classify match results into grades:

```json
{
  "thresholds": {
    "certain": 25,
    "probable": 16
  }
}
```

- Score >= `certain` — match grade `certain` (high confidence)
- Score >= `probable` — match grade `probable` (review recommended)
- Score < `probable` — match grade `possible` (only returned if threshold is overridden below `probable`)

## MatchingModel

Used by the `$match` operation for online queries. Works directly against FHIR resource JSONB columns.

### Create a model

```http
POST /api/models
Content-Type: application/json
```

```json
{
  "resourceType": "MatchingModel",
  "id": "patient-model",
  "resource": "Patient",
  "thresholds": {
    "certain": 25,
    "probable": 16
  },
  "variable": [
    {"name": "dob", "expression": "(#.resource->>'birthDate')"},
    {"name": "given", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'given')"},
    {"name": "family", "expression": "immutable_unaccent_upper(#.resource->'name'->0->>'family')"},
    {"name": "gender", "expression": "(#.resource->>'gender')"}
  ],
  "block": [
    {"name": "dob", "variable": "dob"},
    {"name": "fn", "variable": "family"}
  ],
  "feature": [
    {
      "name": "dob",
      "case": [
        {"expression": "l.#dob = r.#dob", "weight": 10.59},
        {"expression": "levenshtein(l.#dob, r.#dob) <= 1", "weight": 3.99},
        {"else": -10.32}
      ]
    },
    {
      "name": "name",
      "case": [
        {"expression": "l.#given = r.#given AND l.#family = r.#family", "weight": 13.34},
        {"expression": "l.#given = r.#family AND l.#family = r.#given", "weight": 13.10},
        {"expression": "l.#family = r.#family", "weight": 2.40},
        {"else": -12.37}
      ]
    },
    {
      "name": "sex",
      "case": [
        {"expression": "l.#gender = r.#gender", "weight": 1.85},
        {"else": -4.84}
      ]
    }
  ]
}
```

### Model CRUD

| Method | Path | Description |
| --- | --- | --- |
| `GET` | `/api/models` | List all models (optional `?resource=Patient`) |
| `POST` | `/api/models` | Create a model |
| `GET` | `/api/models/:id` | Get model by ID |
| `PUT` | `/api/models/:id` | Update model |
| `DELETE` | `/api/models/:id` | Delete model |

## BulkMatchingModel

Used by the bulk match pipeline. Instead of querying FHIR JSONB at comparison time, it pre-extracts data into typed PostgreSQL columns for faster batch processing.

Key differences from MatchingModel:

- `column` replaces `variable` — each column has a `source` SQL expression and a PostgreSQL `type`
- `tableName` specifies the flat table to create
- `index` defines indexes to create on the flat table after population
- In `block` entries, `variable` refers to a `column` name (not a variable — the field name is the same in both model types)
- In `feature` expressions, reference column names directly (e.g., `l.dob` instead of `l.#dob`)

```json
{
  "resourceType": "BulkMatchingModel",
  "id": "patient-bulk",
  "resource": "Patient",
  "tableName": "mdm.patient_flat",
  "thresholds": {
    "certain": 25,
    "probable": 16
  },
  "column": [
    {"name": "dob", "type": "text", "source": "resource->>'birthDate'"},
    {"name": "family", "type": "text", "source": "immutable_unaccent_upper(resource->'name'->0->>'family')"}
  ],
  "index": [
    {"name": "idx_dob", "column": "dob", "type": "btree"},
    {"name": "idx_family", "column": "family", "type": "btree"}
  ],
  "block": [
    {"name": "dob", "variable": "dob"},
    {"name": "fn", "variable": "family"}
  ],
  "feature": [
    {
      "name": "dob",
      "case": [
        {"expression": "l.dob = r.dob", "weight": 10.59},
        {"else": -10.32}
      ]
    }
  ]
}
```

## Admin UI

Models can be managed through the Admin UI at `/admin`. The UI provides a JSON editor for creating and editing both MatchingModel and BulkMatchingModel resources.

## Tuning

The example weights above are for demonstration. Production deployment requires calibrating weights based on your data. Key considerations:

- **Bayes factors** reflect the discriminative power of each comparison. Higher absolute values mean the feature is more decisive.
- **Thresholds** control the tradeoff between precision and recall. Lower thresholds catch more duplicates but increase false positives.
- **Blocks** determine which pairs are compared. Too broad blocks slow down matching; too narrow blocks miss duplicates.
- **Database indexes** are critical for performance with large datasets. Create indexes on columns used in blocks.

{% hint style="info" %}
Professional tuning services are available. Contact [Health Samurai](https://www.health-samurai.io/company#contact-form) for assistance with model calibration using machine learning and expert analysis.
{% endhint %}

## Next steps

{% content-ref %}
[Find duplicates: $match](match-operation.md)
{% endcontent-ref %}

{% content-ref %}
[Bulk matching](bulk-match.md)
{% endcontent-ref %}

{% content-ref %}
[Mathematical details](mathematical-details.md)
{% endcontent-ref %}