Skip to content

Normalize strings for comparison#247

Merged
m-goggins merged 10 commits into
mainfrom
bug/238-normalize-strings-for-comparison
Mar 12, 2025
Merged

Normalize strings for comparison#247
m-goggins merged 10 commits into
mainfrom
bug/238-normalize-strings-for-comparison

Conversation

@m-goggins

@m-goggins m-goggins commented Mar 11, 2025

Copy link
Copy Markdown
Collaborator

Description

This PR improves the way that RecordLinker normalizes strings for comparison in feature_iter. It adds the normalize_text utility function to remove non-alphanumeric characters, convert to lowercase, and remove all whitespace (trailing, leading, and internal), adds field_validator to ensure all string fields remove trailing and leading whitespace in the PIIRecord, and updates feature_iter to more robustly apply normalization to fields.

Related Issues

#238
#243

Additional Notes

@codecov

codecov Bot commented Mar 11, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.77%. Comparing base (35b38bb) to head (e019342).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
+ Coverage   97.75%   97.77%   +0.02%     
==========================================
  Files          32       33       +1     
  Lines        1689     1711      +22     
==========================================
+ Hits         1651     1673      +22     
  Misses         38       38              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m-goggins m-goggins marked this pull request as ready for review March 11, 2025 18:38
@m-goggins m-goggins self-assigned this Mar 11, 2025
Comment thread src/recordlinker/schemas/pii.py Outdated

@bamader bamader left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just one question on whether you explored the str.maketrans() functionality in python since it's usually orders of magnitude more efficient than other string replacement mechanisms. Normalization is going to get called a lot and that might help keep things smooth.

Comment thread src/recordlinker/utils/normalize.py Outdated
Comment thread src/recordlinker/utils/normalize.py Outdated
Comment thread src/recordlinker/schemas/pii.py
@m-goggins m-goggins requested a review from ericbuckley March 12, 2025 20:59
Comment thread tests/unit/utils/test_normalize.py

@ericbuckley ericbuckley left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@m-goggins m-goggins merged commit 74c7d11 into main Mar 12, 2025
@m-goggins m-goggins deleted the bug/238-normalize-strings-for-comparison branch March 12, 2025 22:43
bamader pushed a commit that referenced this pull request Mar 19, 2025
## Description
This PR improves the way that RecordLinker normalizes strings for
comparison in `feature_iter`. It adds the `normalize_text` utility
function to remove non-alphanumeric characters, convert to lowercase,
and remove all whitespace (trailing, leading, and internal), adds
field_validator to ensure all string fields remove trailing and leading
whitespace in the `PIIRecord`, and updates `feature_iter` to more
robustly apply normalization to fields.


## Related Issues
#238 
#243

## Additional Notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants