Bulk matching

Bulk matching finds all duplicate pairs across an entire dataset. Unlike $match which compares one resource at a time, bulk matching compares every record against every other record in parallel.

Bulk matching requires a BulkMatchingModel. See Matching models.

How it works

The bulk match pipeline has three stages:

Prepare. MDMbox creates a flat PostgreSQL table from FHIR resources using the column definitions in your BulkMatchingModel. This extracts and denormalizes the data needed for comparison, then creates indexes.

Match. Parallel workers compare records in batches. Each worker claims a batch, runs the comparison query, and writes matching pairs to the results table. Workers use FOR UPDATE SKIP LOCKED for lock-free distribution.

Download. Results are streamed as CSV using PostgreSQL's COPY protocol for efficient transfer.

Admin UI

The Admin UI at /admin/bulk-match is the recommended way to run bulk matching. It provides a visual interface for the entire pipeline:

Select a BulkMatchingModel from the dropdown
View flat table status and trigger preparation
Configure and start bulk match jobs
Monitor worker progress in real time
Download results as CSV
Stop, resume, or archive jobs

API workflow

Step 1: Prepare the flat table

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/prepare

This creates the flat table, populates it from FHIR resources, and creates indexes. The operation runs asynchronously. Poll the status endpoint to track progress:

GET https://<mdmbox-host>/api/bulk-match/patient-bulk/status

The response is an OperationOutcome. The diagnostics field contains preparation details as a string:

{
  "resourceType": "OperationOutcome",
  "issue": [
    {
      "severity": "information",
      "code": "informational",
      "details": {"text": "Flat table ready (150000 records)"},
      "diagnostics": "{:model-id \"patient-bulk\", :prepare-status \"ready\", :stage nil, :source-count 150000, :prepare-duration-ms 12500, :prepared-at \"2025-04-10T14:30:00Z\"}"
    }
  ]
}

Possible statuses: pending, preparing, ready, failed.

To force re-creation of the flat table (e.g., after data changes):

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/prepare?force=true

Step 2: Start the bulk match

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/start
Content-Type: application/json

{
  "batchSize": 1000,
  "workersCount": 4
}

batchSize — number of records per worker batch (100 to 10000)
workersCount — number of parallel workers (1 to 16)

Response (HTTP 202):

{
  "resourceType": "OperationOutcome",
  "issue": [
    {
      "severity": "information",
      "code": "informational",
      "details": {"text": "Bulk match started, job #42"},
      "diagnostics": "{:id 42, :model-id \"patient-bulk\", :status \"in-progress\"}"
    }
  ]
}

The job ID in details.text is needed for the download endpoint.

Step 3: Monitor progress

Poll the status endpoint or use the Admin UI which auto-refreshes every 2 seconds.

Step 4: Download results

Once the job completes:

GET https://<mdmbox-host>/api/bulk-match/patient-bulk/download/{job-id}

Returns a CSV file with columns:

Column	Description
`resource_id_1`	First resource ID
`resource_id_2`	Second resource ID
`match_weight`	Total match score
`{feature}_w`	Individual feature weight (one column per feature)

Managing jobs

Stop a running job

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/stop

Workers finish their current batch and exit. For immediate cancellation:

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/stop?force=true

Resume a stopped job

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/continue

Resumes from where it left off — completed batches are not reprocessed.

Archive a job

POST https://<mdmbox-host>/api/bulk-match/patient-bulk/archive

Moves a completed, stopped, or failed job to archived status.

Performance considerations

Batch size affects memory usage per worker. Larger batches reduce overhead but use more memory.
Worker count should not exceed available CPU cores or database connections.
The flat table uses PostgreSQL unlogged tables (no WAL overhead) for faster writes.
Indexes on block columns are critical — without them, the comparison query does a full cross-join.

Each bulk match worker holds a database connection for the duration of its work. Make sure MDMBOX_DB_MAX_POOL_SIZE is large enough to accommodate the number of workers plus normal application traffic.