Skip to content

process skip values#291

Merged
ericbuckley merged 24 commits into
mainfrom
feature/233-process-skip-values
Apr 24, 2025
Merged

process skip values#291
ericbuckley merged 24 commits into
mainfrom
feature/233-process-skip-values

Conversation

@ericbuckley

@ericbuckley ericbuckley commented Apr 14, 2025

Copy link
Copy Markdown
Collaborator

Description

Add functionality to clean data of all skip values before blocking or evaluation.

Related Issues

closes #233

Additional Notes

  • There is some duplication between PIIRecord.feature_iter and the new clean method, but it's not obvious to me that abstracting some of that functionality is a gain over the readability cost to understanding how both methods work. I'm very much open to other options, if others have thoughts on how we could combine logic from these two methods.
  • The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii
  • Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@ericbuckley ericbuckley self-assigned this Apr 14, 2025
@codecov

codecov Bot commented Apr 14, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.25%. Comparing base (f54a090) to head (8f3a781).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
+ Coverage   98.15%   98.25%   +0.10%     
==========================================
  Files          32       33       +1     
  Lines        1838     1946     +108     
==========================================
+ Hits         1804     1912     +108     
  Misses         34       34              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ericbuckley ericbuckley added the api New API feature label Apr 15, 2025
@ericbuckley

Copy link
Copy Markdown
Collaborator Author

@m-goggins @bamader we never discussed making changes to dibbs-default based on this work. Should we add a default "skip_values" list in this PR, maybe in another or should we just leave it blank and let users decide?

@ericbuckley ericbuckley marked this pull request as ready for review April 15, 2025 00:19

@m-goggins m-goggins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing, this is looking pretty good to me!

There is some duplication between PIIRecord.feature_iter and the new clean method, but it's not obvious to me that abstracting some of that functionality is a gain over the readability cost to understanding how both methods work. I'm very much open to other options, if others have thoughts on how we could combine logic from these two methods.

I think this is okay for now; it's definitely readable as is, and we can always re-evaluate in the future if the duplicate code becomes more burdensome to maintain.

The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii

I'm fine to leave in the linking module but I the names "clean" and "matches" are confusing in the context of the rest of record linkage. I would suggest we make it more specific to skip values like remove_skip_values instead of clean and matches_skip_values instead of matches.

Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?

I think we could make it easier on users and if they provide "John Doe" for NAME, we generate the wild cards for them with a helper function so that they don't have to be familiar with the wild cards, only the specific skip values. Thoughts?

Comment thread src/recordlinker/linking/link.py Outdated
Comment thread src/recordlinker/linking/link.py Outdated
Comment thread src/recordlinker/linking/link.py
ericbuckley and others added 2 commits April 17, 2025 12:56
Co-authored-by: Marcelle <53578688+m-goggins@users.noreply.github.com>
@ericbuckley ericbuckley requested a review from m-goggins April 18, 2025 17:19
@m-goggins

Copy link
Copy Markdown
Collaborator

Thoughts on my other feedback?

The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii

I'm fine to leave in the linking module but I the names "clean" and "matches" are confusing in the context of the rest of record linkage. I would suggest we make it more specific to skip values like remove_skip_values instead of clean and matches_skip_values instead of matches.

Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?

I think we could make it easier on users and if they provide "John Doe" for NAME, we generate the wild cards for them with a helper function so that they don't have to be familiar with the wild cards, only the specific skip values. This might also be worth another ticket.

Comment thread src/recordlinker/linking/clean.py Outdated
Comment thread src/recordlinker/linking/clean.py Outdated

@bamader bamader left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks mostly good--just the couple of things around module naming and modes, I think.

Comment thread docs/developer_guide.md Outdated
Comment thread src/recordlinker/linking/clean.py Outdated

@m-goggins m-goggins left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'm glad we have an approachable way to handle skip values to start and can always make changes based on user feedback.

@bamader bamader left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, Eric! I think this is a great middle ground starting point for us to keep things streamlined and intuitive.

@ericbuckley ericbuckley merged commit bf99120 into main Apr 24, 2025
@ericbuckley ericbuckley deleted the feature/233-process-skip-values branch April 24, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api New API feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add cleaning step to process skip_values

3 participants