Skip to content

normalize telecom model_validator#269

Merged
m-goggins merged 17 commits into
mainfrom
feature/248-validate-and-normalize-telecom-data
Apr 1, 2025
Merged

normalize telecom model_validator#269
m-goggins merged 17 commits into
mainfrom
feature/248-validate-and-normalize-telecom-data

Conversation

@m-goggins

@m-goggins m-goggins commented Mar 31, 2025

Copy link
Copy Markdown
Collaborator

Description

This PR adds a model validator to the Telecom class that strips leading/trailing whitespace from emails and normalizes phone as an E164 string, i.e., including country codes but excluding extensions. When using feature_iter, only the national number (no country code) is used for comparison to avoid issues with Jaro-Winkler over indexing on similarities in country codes. For emails, we do not apply text normalization in feature_iter so that we can preserve the special characters for comparison.

Storing:

  • email: email with leading/trailing whitespace removed
  • phone number: normalized to E164 where possible, otherwise store raw number

Comparing:
-email: same as storage
phone number: national number (no country code) and normalized_text() is applied to remove additional special characters. I primarily kept normalized_text in case we were not able to normalize to E164 during ingestion and we want to remove special characters.

Related Issues

Fixes #248

Additional Notes

<--------------------- REMOVE THE LINES BELOW BEFORE MERGING --------------------->

Checklist

Please review and complete the following checklist before submitting your pull request:

  • I have ensured that the pull request is of a manageable size, allowing it to be reviewed within a single session.
  • I have reviewed my changes to ensure they are clear, concise, and well-documented.
  • I have updated the documentation, if applicable.
  • I have added or updated test cases to cover my changes, if applicable.
  • I have minimized the number of reviewers to include only those essential for the review.

Checklist for Reviewers

Please review and complete the following checklist during the review process:

  • The code follows best practices and conventions.
  • The changes implement the desired functionality or fix the reported issue.
  • The tests cover the new changes and pass successfully.
  • Any potential edge cases or error scenarios have been considered.

@m-goggins m-goggins self-assigned this Mar 31, 2025
@m-goggins m-goggins linked an issue Mar 31, 2025 that may be closed by this pull request
5 tasks
@codecov

codecov Bot commented Mar 31, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.83%. Comparing base (b5dc520) to head (42cbc1f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #269      +/-   ##
==========================================
- Coverage   97.84%   97.83%   -0.01%     
==========================================
  Files          33       33              
  Lines        1807     1805       -2     
==========================================
- Hits         1768     1766       -2     
  Misses         39       39              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@m-goggins m-goggins marked this pull request as ready for review March 31, 2025 18:49
Comment thread src/recordlinker/schemas/pii.py
Comment thread src/recordlinker/schemas/pii.py Outdated
Comment thread src/recordlinker/schemas/pii.py Outdated
@m-goggins m-goggins requested a review from ericbuckley March 31, 2025 19:53
Comment thread src/recordlinker/schemas/pii.py Outdated
Comment thread src/recordlinker/schemas/pii.py
@m-goggins m-goggins requested a review from ericbuckley April 1, 2025 01:31
Comment thread src/recordlinker/schemas/pii.py
Comment thread src/recordlinker/schemas/pii.py
ericbuckley
ericbuckley previously approved these changes Apr 1, 2025

@ericbuckley ericbuckley left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

…ature/248-validate-and-normalize-telecom-data
Comment thread tests/unit/schemas/test_pii.py
@m-goggins m-goggins merged commit e2f9c50 into main Apr 1, 2025
@m-goggins m-goggins deleted the feature/248-validate-and-normalize-telecom-data branch April 1, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

validate and normalize telecom data

2 participants