process skip values#291
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #291 +/- ##
==========================================
+ Coverage 98.15% 98.25% +0.10%
==========================================
Files 32 33 +1
Lines 1838 1946 +108
==========================================
+ Hits 1804 1912 +108
Misses 34 34 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@m-goggins @bamader we never discussed making changes to |
There was a problem hiding this comment.
Thanks for implementing, this is looking pretty good to me!
There is some duplication between PIIRecord.feature_iter and the new clean method, but it's not obvious to me that abstracting some of that functionality is a gain over the readability cost to understanding how both methods work. I'm very much open to other options, if others have thoughts on how we could combine logic from these two methods.
I think this is okay for now; it's definitely readable as is, and we can always re-evaluate in the future if the duplicate code becomes more burdensome to maintain.
The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii
I'm fine to leave in the linking module but I the names "clean" and "matches" are confusing in the context of the rest of record linkage. I would suggest we make it more specific to skip values like remove_skip_values instead of clean and matches_skip_values instead of matches.
Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?
I think we could make it easier on users and if they provide "John Doe" for NAME, we generate the wild cards for them with a helper function so that they don't have to be familiar with the wild cards, only the specific skip values. Thoughts?
Co-authored-by: Marcelle <53578688+m-goggins@users.noreply.github.com>
|
Thoughts on my other feedback?
I'm fine to leave in the linking module but I the names "clean" and "matches" are confusing in the context of the rest of record linkage. I would suggest we make it more specific to skip values like remove_skip_values instead of clean and matches_skip_values instead of matches.
I think we could make it easier on users and if they provide "John Doe" for NAME, we generate the wild cards for them with a helper function so that they don't have to be familiar with the wild cards, only the specific skip values. This might also be worth another ticket. |
bamader
left a comment
There was a problem hiding this comment.
Code looks mostly good--just the couple of things around module naming and modes, I think.
m-goggins
left a comment
There was a problem hiding this comment.
Looks good to me! I'm glad we have an approachable way to handle skip values to start and can always make changes based on user feedback.
bamader
left a comment
There was a problem hiding this comment.
Good stuff, Eric! I think this is a great middle ground starting point for us to keep things streamlined and intuitive.
Description
Add functionality to clean data of all skip values before blocking or evaluation.
Related Issues
closes #233
Additional Notes
PIIRecord.feature_iterand the new clean method, but it's not obvious to me that abstracting some of that functionality is a gain over the readability cost to understanding how both methods work. I'm very much open to other options, if others have thoughts on how we could combine logic from these two methods.cleanmethod was purposefully put into thelinkingmodule to remove the potential for a circular import betweenschemas.algorithmandschemas.piifnmatchover aregexhas some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->
Checklist
Please review and complete the following checklist before submitting your pull request:
Checklist for Reviewers
Please review and complete the following checklist during the review process: