Handle missing fields during evaluation#259
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #259 +/- ##
==========================================
+ Coverage 97.82% 97.85% +0.02%
==========================================
Files 33 33
Lines 1748 1770 +22
==========================================
+ Hits 1710 1732 +22
Misses 38 38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
m-goggins
left a comment
There was a problem hiding this comment.
Overall, this is looking solid. Most of my comments were about making sure we use consistent naming from model/schema definitions, link.py, and matchers.py` to make it easier to understand how all the pieces fit together.
I do worry about recalculating the missing_field_weights after evaluating in compare if a user can set missing_points_proportion to 0.0. Ultimately, I think we should let users set that field to be 0.0 if they want (even though we know this punishes missing values too much) so we probably need to account for that in our function and tests.
Co-authored-by: Marcelle <53578688+m-goggins@users.noreply.github.com>
|
Names for variables should be fully standardized, and the division by 0 case should be avoided now that we directly pass in log-odds. |
m-goggins
left a comment
There was a problem hiding this comment.
Thanks for making the language changes; much easier to follow! Would you mind making one additional test where we test a missing_fields_points_proportion and/or max_allowed_field_points_proportion is something other than 0.5. Maybe both set to 0.0?
|
@m-goggins No problem at all! I'll actually do you one better--your comment about testing 0's in both fields made me think of an "edge case" where the user sets |
m-goggins
left a comment
There was a problem hiding this comment.
Awesome and good thought about the additional edge case!
| incoming_record_fields = list(patient.record.feature_iter(key)) | ||
| mpi_record_fields = list(record.feature_iter(key)) | ||
| if len(incoming_record_fields) == 0 or len(mpi_record_fields) == 0: | ||
| return (missing_field_points_proportion * log_odds, True) |
There was a problem hiding this comment.
I know this is merged, but I have an upcoming refactor on #249 to rebase these changes.
If the signature was changed to return a bool if missing data is encountered, is it necessary to calculate the missing field points proportion in the compare_* methods? It seems like this will be duplicated logic across all compare_* methods, and if we just do this on the link.compare function we can dry it up. Thoughts?
There was a problem hiding this comment.
If I understand the comment, you're saying line 155 contains a calculation that happens in two different functions (compare_exact and compare_fuzzy). In my eyes, that calculation belongs inside both compare functions and is not an instance of unnecessary code duplication (it's also just two functions). The idea of the scope of a compare function is to tell the caller how many points a candidate match should earn for a particular field. That means it is responsible for applying all properties and parameters needed to determine that value. This is why compare_fuzzy for example applies thresholding--because it's part of determining how much value a field comparison earns. While it's possible for the calling compare loop to do this by looking up the feature in question in the log-odds dictionary, I believe this is a) less clear and intuitive, b) defeats the abstraction and scope of self-contained compare functions (which would no longer be independent producers of the field value), and c) introduces unnecessary case-by-case bases in scoring, with results sometimes sometimes being produced in the compare_* and sometimes being in compare. In my opinion, these compare_* functions are correctly written as-is to make the logic the clearest and allow each function to be modular.
There was a problem hiding this comment.
I totally get that, and re-reading my previous comment, "dry it up" was the wrong line of thought to my concern.
When I think about the two new configuration values, "max_missing_allowed_proportion" and "missing_field_points_proption", they feel inherently linked. From a readability perspective, it might make sense to handle their logic together in the link.compare function. This way, the relationship between them is more apparent to the next developer reading the code. Maybe I’m balancing readability against functional correctness?
No need to act on this right now, I'm going to experiment with both approaches in #249. I guess just keep this in mind for the conversation we have on that PR.
There was a problem hiding this comment.
Ah okay, thanks for clarifying. I see a little bit better what you're saying now.
Interestingly, I actually have the opposite view. To me, the values would be linked if setting one somehow influenced or determined a range of values to set the other (they might be correlated in that way). But I actually found in thinking through implementation, I believe there to be reasonable, plausible scenarios for which users might set them at completely opposite ends of the spectrum. For example, in DIBBs default, we have only a small number of evaluation fields, each with very large log odds weight. In this context, we likely want to set our max_missing_allowed_proportion somewhere low, so that missing one field means the record isn't likely to make a comparison match. But the reverse situation is also conceivable (and we added a test case for it)--a user might have an algorithm set up in which they want to compare on many different fields, but none of their individual log-odds values are massively dominant over all others. In this use context, I could see justifications for having a high missingness allowed and low points for missing fields (if I knew my data quality was going to be really messy, which we know empirically is the case for streams like eLR), as well as a justification for having a low missingness allowed but a high points for missing fields (where I want too much missing data to disqualify a match but only a little bit to not be a hindrance if I care about finding true matches).
I bring this up because I think it underscores another point I alluded to in my comment, which is the separation between a single field comparison function and a looping comparison function. I think our code benefits a lot from the modularity of having a single function (or function type in compare_*) be solely responsible for deciding how many points a field comparison is worth. If I'm a user trying to make sense of where that value comes from, I have to look in a single place in the code, in the location where all the other field-wise parameters are accounted for. It makes intuitive sense to me. It also makes sense from the other side, thinking about the responsibility of the looping compare function. This is more like an orchestrator; it's not a function that gets in the weeds with arbitrating results on individual field comparisons. It's a function that tells subordinates it calls "you tell me what your field status is, and I'll figure out the best way to use that information." There's a cleanness there, in my opinion, a self-containment that actually makes the code more readable by not grouping the parameters.
A related example is something like thresholding. Right now, the compare_fuzzy function does its own thresholding to 0 if the jaro winkler similarity is below the configuration-specified value. That makes sense to me for the same reasons as above, namely that one function is presenting its caller with the final verdict on how many points a field is worth (and the decision to drop the MPI match candidate from consideration doesn't actually change how many points that field is worth, that's always the same number for a given algorithm config). If we pulled that up into compare, we reduce modularity and in my opinion sacrifice readability. Now, sometimes the results from the compare_fuzzy function can be trusted as final, if a field score's jaro winkler is above the threshold; but other times, it can't be, if the jaro winkler similarity is low. And the fact that we don't know this when we evaluate means the functions are now no longer independent, and in my opinion this also makes it harder for a user or future engineer to understand and debug. Curious to hear @m-goggins thoughts on this.
I think what I'm surfacing for myself is that I would rather see us go slowly on reorganizing kwargs and the algorithm config because it might have readability and intuition consequences on other parts of the code.
Description
This PR adds more robust handling of comparisons between records in which one or more fields are missing data. If only a small amount of data is missing (fewer log-odds-worth than a specified percentage of the total possible), then missing comparisons are awarded partial weight to allow for data recovery and quality matches. However, if too much data is missing, the candidacy is automatically rejected.
Related Issues
Closes #235
Additional Notes
Just changed some variable names from the ticket for clarity
<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->
Checklist
Please review and complete the following checklist before submitting your pull request:
Checklist for Reviewers
Please review and complete the following checklist during the review process: