MDMbox Docs

Bulk matching

Bulk matching finds all duplicate pairs across an entire dataset. Unlike $match which compares one resource at a time, bulk matching compares every record against every other record in parallel.

Bulk matching requires a BulkMatchingModel. See Matching models.

How it works

The bulk match pipeline has three stages:

Prepare Match Download

Prepare. MDMbox creates a flat PostgreSQL table from FHIR resources using the column definitions in your BulkMatchingModel. This extracts and denormalizes the data needed for comparison, then creates indexes.

Match. Parallel workers compare records in batches. Each worker claims a batch, runs the comparison query, and writes matching pairs to the results table. Workers use FOR UPDATE SKIP LOCKED for lock-free distribution.

Download. Results are streamed as CSV using PostgreSQL's COPY protocol for efficient transfer.

Admin UI

The Admin UI at /admin/bulk-match is the recommended way to run bulk matching. It provides a visual interface for the entire pipeline:

  • Select a BulkMatchingModel from the dropdown
  • View flat table status and trigger preparation
  • Configure and start bulk match jobs
  • Monitor worker progress in real time
  • Download results as CSV
  • Stop, resume, or archive jobs

API workflow

Step 1: Prepare the flat table

POST /api/bulk-match/patient-bulk/prepare

This creates the flat table, populates it from FHIR resources, and creates indexes. The operation runs asynchronously. Poll the status endpoint to track progress:

GET /api/bulk-match/patient-bulk/status

The response is an OperationOutcome. The diagnostics field contains preparation details as a string:

{
  "resourceType": "OperationOutcome",
  "issue": [
    {
      "severity": "information",
      "code": "informational",
      "details": {"text": "Flat table ready (150000 records)"},
      "diagnostics": "{:model-id \"patient-bulk\", :prepare-status \"ready\", :stage nil, :source-count 150000, :prepare-duration-ms 12500, :prepared-at \"2025-04-10T14:30:00Z\"}"
    }
  ]
}

Possible statuses: pending, preparing, ready, failed.

To force re-creation of the flat table (e.g., after data changes):

POST /api/bulk-match/patient-bulk/prepare?force=true

Step 2: Start the bulk match

POST /api/bulk-match/patient-bulk/start
Content-Type: application/json
{
  "batchSize": 1000,
  "workersCount": 4
}
  • batchSize — number of records per worker batch (100 to 10000)
  • workersCount — number of parallel workers (1 to 16)

Response (HTTP 202):

{
  "resourceType": "OperationOutcome",
  "issue": [
    {
      "severity": "information",
      "code": "informational",
      "details": {"text": "Bulk match started, job #42"},
      "diagnostics": "{:id 42, :model-id \"patient-bulk\", :status \"in-progress\"}"
    }
  ]
}

The job ID in details.text is needed for the download endpoint.

Step 3: Monitor progress

Poll the status endpoint or use the Admin UI which auto-refreshes every 2 seconds.

Step 4: Download results

Once the job completes:

GET /api/bulk-match/patient-bulk/download/{job-id}

Returns a CSV file with columns:

ColumnDescription
resource_id_1First resource ID
resource_id_2Second resource ID
match_weightTotal match score
{feature}_wIndividual feature weight (one column per feature)

Managing jobs

Stop a running job

POST /api/bulk-match/patient-bulk/stop

Workers finish their current batch and exit. For immediate cancellation:

POST /api/bulk-match/patient-bulk/stop?force=true

Resume a stopped job

POST /api/bulk-match/patient-bulk/continue

Resumes from where it left off — completed batches are not reprocessed.

Archive a job

POST /api/bulk-match/patient-bulk/archive

Moves a completed, stopped, or failed job to archived status.

Performance considerations

  • Batch size affects memory usage per worker. Larger batches reduce overhead but use more memory.
  • Worker count should not exceed available CPU cores or database connections.
  • The flat table uses PostgreSQL unlogged tables (no WAL overhead) for faster writes.
  • Indexes on block columns are critical — without them, the comparison query does a full cross-join.

Each bulk match worker holds a database connection for the duration of its work. Make sure MDMBOX_DB_MAX_POOL_SIZE is large enough to accommodate the number of workers plus normal application traffic.

Last updated: