Skip to content

improve blocking with missing payload keys #230

@ericbuckley

Description

@ericbuckley

Summary

Update the blocking query to skip optionally skip over missing blocking keys.

Acceptance Criteria

  • A reimplementation of get_block_data
  • Add documentation to the site/design.md about how blocking uses log odds
  • Add new algorithm configuration variable, "defaults/compare_minimum_percentage" with a default of 0.7.

Details / Tasks

Modify the get_block_data function to optionally skip over the blocking key if its values are missing. If we total all the log odds scores for the blocking keys in a pass, we must have at least the default/compare_minimum_percentage keys available to block on, otherwise we should just abort the query and return the empty dataset as the query will be too large.

Background / Context

When an incoming payload has missing data, specifically data elements missing in blocking keys, we only return records that also have missing data in those fields. This rewards returning incomplete records and penalizing those records that have values in those fields. In some instances, it would be advantageous to also get back records with data in those fields. On the flip side, we need to balance this, as if we skip too many blocking keys because of missing data we run the risk of returning too many records to analyze.

Our algorithms already store log odds values, which give us some indication as to how valuable a match is in record linkage. We can reuse those values in blocking to make decisions on whether blocking keys can be skipped. If too many values are skipped, we should just return the empty set and skip the pass altogether.

Related Issues/PRs

#223

Metadata

Metadata

Assignees

Labels

apiNew API feature

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions