process skip values by ericbuckley · Pull Request #291 · CDCgov/RecordLinker

ericbuckley · 2025-04-14T18:53:55Z

Description

Add functionality to clean data of all skip values before blocking or evaluation.

Related Issues

closes #233

Additional Notes

There is some duplication between PIIRecord.feature_iter and the new clean method, but it's not obvious to me that abstracting some of that functionality is a gain over the readability cost to understanding how both methods work. I'm very much open to other options, if others have thoughts on how we could combine logic from these two methods.
The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii
Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
I have reviewed my changes to ensure they are clear, concise, and well-documented.
I have updated the documentation, if applicable.
I have added or updated test cases to cover my changes, if applicable.
I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

The code follows best practices and conventions.
The changes implement the desired functionality or fix the reported issue.
The tests cover the new changes and pass successfully.
Any potential edge cases or error scenarios have been considered.

codecov · 2025-04-14T21:41:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.25%. Comparing base (f54a090) to head (8f3a781).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #291      +/-   ##
==========================================
+ Coverage   98.15%   98.25%   +0.10%     
==========================================
  Files          32       33       +1     
  Lines        1838     1946     +108     
==========================================
+ Hits         1804     1912     +108     
  Misses         34       34

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ericbuckley · 2025-04-15T00:19:50Z

@m-goggins @bamader we never discussed making changes to dibbs-default based on this work. Should we add a default "skip_values" list in this PR, maybe in another or should we just leave it blank and let users decide?

m-goggins

Thanks for implementing, this is looking pretty good to me!

There is some duplication between PIIRecord.feature_iter and the new clean method, but it's not obvious to me that abstracting some of that functionality is a gain over the readability cost to understanding how both methods work. I'm very much open to other options, if others have thoughts on how we could combine logic from these two methods.

I think this is okay for now; it's definitely readable as is, and we can always re-evaluate in the future if the duplicate code becomes more burdensome to maintain.

The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii

I'm fine to leave in the linking module but I the names "clean" and "matches" are confusing in the context of the rest of record linkage. I would suggest we make it more specific to skip values like remove_skip_values instead of clean and matches_skip_values instead of matches.

Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?

I think we could make it easier on users and if they provide "John Doe" for NAME, we generate the wild cards for them with a helper function so that they don't have to be familiar with the wild cards, only the specific skip values. Thoughts?

Co-authored-by: Marcelle <53578688+m-goggins@users.noreply.github.com>

m-goggins · 2025-04-18T18:41:19Z

Thoughts on my other feedback?

The new clean method was purposefully put into the linking module to remove the potential for a circular import between schemas.algorithm and schemas.pii

I'm fine to leave in the linking module but I the names "clean" and "matches" are confusing in the context of the rest of record linkage. I would suggest we make it more specific to skip values like remove_skip_values instead of clean and matches_skip_values instead of matches.

Using fnmatch over a regex has some real downsides when matching on "John Doe". I think our default algorithm config is going to end up have 6 values (eg `["John Doe", "John * Doe", "Jon Doe", "Jon * Doe", "Jane Doe", "Jane * Doe"]). Open to hear if people think this is a good reason to switch to a regular expression, or maybe we support both somehow?

I think we could make it easier on users and if they provide "John Doe" for NAME, we generate the wild cards for them with a helper function so that they don't have to be familiar with the wild cards, only the specific skip values. This might also be worth another ticket.

bamader

Code looks mostly good--just the couple of things around module naming and modes, I think.

m-goggins

Looks good to me! I'm glad we have an approachable way to handle skip values to start and can always make changes based on user feedback.

bamader

Good stuff, Eric! I think this is a great middle ground starting point for us to keep things streamlined and intuitive.

ericbuckley added 5 commits April 7, 2025 19:52

adding skip values to algorithm

9ebfc97

draft version of cleaning a record with skip values

eba46ab

Merge branch 'main' into feature/233-process-skip-values

d3d0e39

test cases for cleaning

7539d7d

Merge branch 'main' into feature/233-process-skip-values

7055bf1

ericbuckley self-assigned this Apr 14, 2025

ericbuckley added 2 commits April 14, 2025 11:55

fix test case

f1c3bf0

fixing syntax

d9bd22f

ericbuckley added 7 commits April 14, 2025 16:40

adding info to developer guide on updating clean

e53a225

moving comment

0b700e1

Identifier.value should be required

8171897

revert change

0c39cdb

revert reformatting

21db405

elements in PiiRecord.race list should not be optional

db6c920

clean for feature NAME

8705c82

ericbuckley added the api New API feature label Apr 15, 2025

ericbuckley marked this pull request as ready for review April 15, 2025 00:19

ericbuckley requested review from bamader and m-goggins as code owners April 15, 2025 00:19

m-goggins reviewed Apr 17, 2025

View reviewed changes

Comment thread src/recordlinker/linking/link.py Outdated

Comment thread src/recordlinker/linking/link.py Outdated

Comment thread src/recordlinker/linking/link.py

ericbuckley and others added 2 commits April 17, 2025 12:56

Update src/recordlinker/linking/link.py

1d9db68

Co-authored-by: Marcelle <53578688+m-goggins@users.noreply.github.com>

changing the compare sign to accept two piirecords

af509f0

ericbuckley requested a review from m-goggins April 18, 2025 17:19

ericbuckley commented Apr 21, 2025

View reviewed changes

Comment thread src/recordlinker/linking/clean.py Outdated

ericbuckley commented Apr 21, 2025

View reviewed changes

Comment thread src/recordlinker/linking/clean.py Outdated

rename clean method to remove_skip_values

0f14a31

bamader reviewed Apr 21, 2025

View reviewed changes

Comment thread docs/developer_guide.md Outdated

Comment thread src/recordlinker/linking/clean.py Outdated

ericbuckley added 7 commits April 22, 2025 08:59

rename clean.matches

f4e1cac

ignoring nextjs build artifacts

a5f9de0

renaming clean module to skip_values

059ca4f

updating docs for skip values

011709f

removing ellipsis

bdc5ed5

Merge branch 'main' into feature/233-process-skip-values

a9d829c

removing fnmatch syntax from skip values

8f3a781

m-goggins approved these changes Apr 23, 2025

View reviewed changes

bamader approved these changes Apr 24, 2025

View reviewed changes

ericbuckley merged commit bf99120 into main Apr 24, 2025

ericbuckley deleted the feature/233-process-skip-values branch April 24, 2025 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process skip values#291

process skip values#291
ericbuckley merged 24 commits into
mainfrom
feature/233-process-skip-values

ericbuckley commented Apr 14, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Apr 14, 2025 •

edited

Loading

Uh oh!

ericbuckley commented Apr 15, 2025

Uh oh!

m-goggins left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m-goggins commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

bamader left a comment

Uh oh!

Uh oh!

Uh oh!

m-goggins left a comment

Uh oh!

bamader left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ericbuckley commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Additional Notes

Checklist

Checklist for Reviewers

Uh oh!

codecov Bot commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ericbuckley commented Apr 15, 2025

Uh oh!

m-goggins left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

m-goggins commented Apr 18, 2025

Uh oh!

Uh oh!

Uh oh!

bamader left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

m-goggins left a comment

Choose a reason for hiding this comment

Uh oh!

bamader left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericbuckley commented Apr 14, 2025 •

edited

Loading

codecov Bot commented Apr 14, 2025 •

edited

Loading

m-goggins left a comment •

edited

Loading