Skip to content

Handle missing fields during evaluation#259

Merged
bamader merged 7 commits into
mainfrom
eval-missing-fields
Mar 24, 2025
Merged

Handle missing fields during evaluation#259
bamader merged 7 commits into
mainfrom
eval-missing-fields

Conversation

@bamader

@bamader bamader commented Mar 20, 2025

Copy link
Copy Markdown
Collaborator

Description

This PR adds more robust handling of comparisons between records in which one or more fields are missing data. If only a small amount of data is missing (fewer log-odds-worth than a specified percentage of the total possible), then missing comparisons are awarded partial weight to allow for data recovery and quality matches. However, if too much data is missing, the candidacy is automatically rejected.

Related Issues

Closes #235

Additional Notes

Just changed some variable names from the ticket for clarity

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@codecov

codecov Bot commented Mar 20, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.85%. Comparing base (c8a38fd) to head (4add566).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #259      +/-   ##
==========================================
+ Coverage   97.82%   97.85%   +0.02%     
==========================================
  Files          33       33              
  Lines        1748     1770      +22     
==========================================
+ Hits         1710     1732      +22     
  Misses         38       38              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m-goggins m-goggins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is looking solid. Most of my comments were about making sure we use consistent naming from model/schema definitions, link.py, and matchers.py` to make it easier to understand how all the pieces fit together.

I do worry about recalculating the missing_field_weights after evaluating in compare if a user can set missing_points_proportion to 0.0. Ultimately, I think we should let users set that field to be 0.0 if they want (even though we know this punishes missing values too much) so we probably need to account for that in our function and tests.

Comment thread docs/site/reference.md Outdated
Comment thread src/recordlinker/assets/initial_algorithms.json Outdated
Comment thread src/recordlinker/linking/matchers.py Outdated
Comment thread src/recordlinker/linking/matchers.py Outdated
Comment thread src/recordlinker/linking/link.py
Comment thread src/recordlinker/linking/link.py Outdated
Comment thread src/recordlinker/linking/link.py Outdated
Comment thread src/recordlinker/linking/matchers.py Outdated
Comment thread src/recordlinker/linking/link.py Outdated
bamader and others added 3 commits March 21, 2025 14:26
Co-authored-by: Marcelle <53578688+m-goggins@users.noreply.github.com>
@bamader

bamader commented Mar 21, 2025

Copy link
Copy Markdown
Collaborator Author

Names for variables should be fully standardized, and the division by 0 case should be avoided now that we directly pass in log-odds.

@bamader bamader requested a review from m-goggins March 21, 2025 18:49

@m-goggins m-goggins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the language changes; much easier to follow! Would you mind making one additional test where we test a missing_fields_points_proportion and/or max_allowed_field_points_proportion is something other than 0.5. Maybe both set to 0.0?

@bamader

bamader commented Mar 24, 2025

Copy link
Copy Markdown
Collaborator Author

@m-goggins No problem at all! I'll actually do you one better--your comment about testing 0's in both fields made me think of an "edge case" where the user sets missing_fields_proportion to be much larger than missingness_allowed (I outlined a scenario in the tests where this might happen). The concept is a bit the same as some of the other tests, but I thought it was worth calling out that we manage results from both the perspective of missing field points and missingness overall, regardless of which one is a larger proportion.

@bamader bamader requested a review from m-goggins March 24, 2025 18:04

@m-goggins m-goggins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome and good thought about the additional edge case!

@bamader bamader merged commit cae7976 into main Mar 24, 2025
@bamader bamader deleted the eval-missing-fields branch March 24, 2025 19:08
incoming_record_fields = list(patient.record.feature_iter(key))
mpi_record_fields = list(record.feature_iter(key))
if len(incoming_record_fields) == 0 or len(mpi_record_fields) == 0:
return (missing_field_points_proportion * log_odds, True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is merged, but I have an upcoming refactor on #249 to rebase these changes.

If the signature was changed to return a bool if missing data is encountered, is it necessary to calculate the missing field points proportion in the compare_* methods? It seems like this will be duplicated logic across all compare_* methods, and if we just do this on the link.compare function we can dry it up. Thoughts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the comment, you're saying line 155 contains a calculation that happens in two different functions (compare_exact and compare_fuzzy). In my eyes, that calculation belongs inside both compare functions and is not an instance of unnecessary code duplication (it's also just two functions). The idea of the scope of a compare function is to tell the caller how many points a candidate match should earn for a particular field. That means it is responsible for applying all properties and parameters needed to determine that value. This is why compare_fuzzy for example applies thresholding--because it's part of determining how much value a field comparison earns. While it's possible for the calling compare loop to do this by looking up the feature in question in the log-odds dictionary, I believe this is a) less clear and intuitive, b) defeats the abstraction and scope of self-contained compare functions (which would no longer be independent producers of the field value), and c) introduces unnecessary case-by-case bases in scoring, with results sometimes sometimes being produced in the compare_* and sometimes being in compare. In my opinion, these compare_* functions are correctly written as-is to make the logic the clearest and allow each function to be modular.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally get that, and re-reading my previous comment, "dry it up" was the wrong line of thought to my concern.

When I think about the two new configuration values, "max_missing_allowed_proportion" and "missing_field_points_proption", they feel inherently linked. From a readability perspective, it might make sense to handle their logic together in the link.compare function. This way, the relationship between them is more apparent to the next developer reading the code. Maybe I’m balancing readability against functional correctness?

No need to act on this right now, I'm going to experiment with both approaches in #249. I guess just keep this in mind for the conversation we have on that PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, thanks for clarifying. I see a little bit better what you're saying now.

Interestingly, I actually have the opposite view. To me, the values would be linked if setting one somehow influenced or determined a range of values to set the other (they might be correlated in that way). But I actually found in thinking through implementation, I believe there to be reasonable, plausible scenarios for which users might set them at completely opposite ends of the spectrum. For example, in DIBBs default, we have only a small number of evaluation fields, each with very large log odds weight. In this context, we likely want to set our max_missing_allowed_proportion somewhere low, so that missing one field means the record isn't likely to make a comparison match. But the reverse situation is also conceivable (and we added a test case for it)--a user might have an algorithm set up in which they want to compare on many different fields, but none of their individual log-odds values are massively dominant over all others. In this use context, I could see justifications for having a high missingness allowed and low points for missing fields (if I knew my data quality was going to be really messy, which we know empirically is the case for streams like eLR), as well as a justification for having a low missingness allowed but a high points for missing fields (where I want too much missing data to disqualify a match but only a little bit to not be a hindrance if I care about finding true matches).

I bring this up because I think it underscores another point I alluded to in my comment, which is the separation between a single field comparison function and a looping comparison function. I think our code benefits a lot from the modularity of having a single function (or function type in compare_*) be solely responsible for deciding how many points a field comparison is worth. If I'm a user trying to make sense of where that value comes from, I have to look in a single place in the code, in the location where all the other field-wise parameters are accounted for. It makes intuitive sense to me. It also makes sense from the other side, thinking about the responsibility of the looping compare function. This is more like an orchestrator; it's not a function that gets in the weeds with arbitrating results on individual field comparisons. It's a function that tells subordinates it calls "you tell me what your field status is, and I'll figure out the best way to use that information." There's a cleanness there, in my opinion, a self-containment that actually makes the code more readable by not grouping the parameters.

A related example is something like thresholding. Right now, the compare_fuzzy function does its own thresholding to 0 if the jaro winkler similarity is below the configuration-specified value. That makes sense to me for the same reasons as above, namely that one function is presenting its caller with the final verdict on how many points a field is worth (and the decision to drop the MPI match candidate from consideration doesn't actually change how many points that field is worth, that's always the same number for a given algorithm config). If we pulled that up into compare, we reduce modularity and in my opinion sacrifice readability. Now, sometimes the results from the compare_fuzzy function can be trusted as final, if a field score's jaro winkler is above the threshold; but other times, it can't be, if the jaro winkler similarity is low. And the fact that we don't know this when we evaluate means the functions are now no longer independent, and in my opinion this also makes it harder for a user or future engineer to understand and debug. Curious to hear @m-goggins thoughts on this.

I think what I'm surfacing for myself is that I would rather see us go slowly on reorganizing kwargs and the algorithm config because it might have readability and intuition consequences on other parts of the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve evaluation of missing data fields

3 participants