Mathematical details
See the fastlink paper for a detailed treatment.
The algorithm is based on comparisons between pairs of records (FHIR resources).
Comparison functions
Define a set of comparison functions over pairs of records. Each comparison function returns a single category, for example:
- null (value missing)
- significantly different
- slightly different
- exactly equal
Different comparison functions can have different sets of possible categories. In MDMbox, these correspond to the case entries within a feature definition.
An example comparison function for surnames:
- -1, if the surname of one of the records is missing
- 0, if Levenshtein distance between surnames is greater than 2
- 1, if Levenshtein distance is 2
- 2, if Levenshtein distance is 1
- 3, if surnames are equal
Bayes factors
Two records match if they belong to the same entity (e.g., two records for the same patient).
Using Bayes' theorem, we define the prior probability as the probability that two random records match.
For each comparison function value, define:
- m-probability: probability of this value given that records match
- u-probability: probability of this value given that records do not match
The ratio m/u is the Bayes factor. In MDMbox, the weight field in feature cases represents log2 of the Bayes factor.
Match score
The match score is the product of all Bayes factors multiplied by the prior. In log space (as used by MDMbox), this becomes the sum of log2 Bayes factors:
match_weight = weight_1 + weight_2 + ... + weight_n
To convert to a probability estimate:
probability = x / (1 + x) where x = 2^match_weight (or equivalently, probability = 1 / (1 + 2^(-match_weight))).
Independence assumption
Comparison functions are assumed to be mutually independent. In practice, the algorithm is robust to moderate violations of this assumption.
Parameter estimation
The m-probabilities and u-probabilities can be estimated from data using the EM (Expectation-Maximization) algorithm. This is discussed in detail in the fastlink paper. For MDMbox, these parameters are specified directly in the model as feature weights, typically calibrated through a combination of domain expertise and statistical analysis.