Normalize strings for comparison#247
Merged
Merged
Conversation
…g/238-normalize-strings-for-comparison
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #247 +/- ##
==========================================
+ Coverage 97.75% 97.77% +0.02%
==========================================
Files 32 33 +1
Lines 1689 1711 +22
==========================================
+ Hits 1651 1673 +22
Misses 38 38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ericbuckley
reviewed
Mar 11, 2025
bamader
reviewed
Mar 11, 2025
bamader
left a comment
Collaborator
There was a problem hiding this comment.
Looks good overall, just one question on whether you explored the str.maketrans() functionality in python since it's usually orders of magnitude more efficient than other string replacement mechanisms. Normalization is going to get called a lot and that might help keep things smooth.
ericbuckley
reviewed
Mar 11, 2025
ericbuckley
reviewed
Mar 12, 2025
ericbuckley
reviewed
Mar 12, 2025
bamader
pushed a commit
that referenced
this pull request
Mar 19, 2025
## Description This PR improves the way that RecordLinker normalizes strings for comparison in `feature_iter`. It adds the `normalize_text` utility function to remove non-alphanumeric characters, convert to lowercase, and remove all whitespace (trailing, leading, and internal), adds field_validator to ensure all string fields remove trailing and leading whitespace in the `PIIRecord`, and updates `feature_iter` to more robustly apply normalization to fields. ## Related Issues #238 #243 ## Additional Notes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR improves the way that RecordLinker normalizes strings for comparison in
feature_iter. It adds thenormalize_textutility function to remove non-alphanumeric characters, convert to lowercase, and remove all whitespace (trailing, leading, and internal), adds field_validator to ensure all string fields remove trailing and leading whitespace in thePIIRecord, and updatesfeature_iterto more robustly apply normalization to fields.Related Issues
#238
#243
Additional Notes