Summary
Update the blocking query to skip optionally skip over missing blocking keys.
Acceptance Criteria
Details / Tasks
Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the default/compare_minimum_percentage keys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.
Background / Context
When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.
Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.
Related Issues/PRs
#223
Summary
Update the blocking query to skip optionally skip over missing blocking keys.
Acceptance Criteria
Details / Tasks
Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the
default/compare_minimum_percentagekeys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.Background / Context
When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.
Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.
Related Issues/PRs
#223