Skip to content

Improve email regexp on edge cases#10601

Merged
sydney-runkle merged 3 commits intopydantic:mainfrom
AlekseyLobanov:fix.email-regex
Oct 11, 2024
Merged

Improve email regexp on edge cases#10601
sydney-runkle merged 3 commits intopydantic:mainfrom
AlekseyLobanov:fix.email-regex

Conversation

@AlekseyLobanov
Copy link
Copy Markdown
Contributor

@AlekseyLobanov AlekseyLobanov commented Oct 10, 2024

  • Drastically improves performance on cases like "<" + " " * N
  • Last spaces are not needed anyway because this group is stripped later. Also spaces will be caught by . anyway.

Change Summary

I found that one single change in email regexp solves slowdowns on special invalid email strings. See related issue for details

Related issue number

Fixes #10600

Checklist

  • The pull request title is a good summary of the changes - it will be used in the changelog
  • Unit tests for the changes exist
  • Tests pass on CI
  • Documentation reflects the changes where applicable
  • My PR is ready to review, please add a comment including the phrase "please review" to assign reviewers

Selected Reviewer: @sydney-runkle

- Drastically improves performance on cases like `"<" + " " * N`
- Last spaces are not needed anyway because this group is
stripped later. Also spaces will be caught by `.` anyway.
@github-actions github-actions bot added the relnotes-fix Used for bugfixes. label Oct 10, 2024
@AlekseyLobanov
Copy link
Copy Markdown
Contributor Author

please review

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Oct 10, 2024

CodSpeed Performance Report

Merging #10601 will not alter performance

Comparing AlekseyLobanov:fix.email-regex (e35c507) with main (c772b43)

Summary

✅ 38 untouched benchmarks

@AlekseyLobanov
Copy link
Copy Markdown
Contributor Author

AlekseyLobanov commented Oct 10, 2024

How performance changes?
I use my own POC in #10600 and run it as /usr/bin/time python pydantic-poc.py 500

  • Before: 5.35user 0.01system 0:05.37elapsed 99%CPU (0avgtext+0avgdata 36840maxresident)k
  • After: 0.20user 0.01system 0:00.22elapsed 99%CPU (0avgtext+0avgdata 37004maxresident)k

About 25x speed improvement.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 10, 2024

Coverage report

This PR does not seem to contain any modification to coverable code.

Copy link
Copy Markdown
Member

@Viicos Viicos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks reasonable but to be extra careful we'll wait for other reviews as well.

Could you add the following to the test_address_valid test?:

        ('Samuel Colvin < s@muelcolvin.com>', 'Samuel Colvin', 's@muelcolvin.com'),
        ('Samuel Colvin <s@muelcolvin.com >', 'Samuel Colvin', 's@muelcolvin.com'),
        ('Samuel Colvin < s@muelcolvin.com >', 'Samuel Colvin', 's@muelcolvin.com'),

Comment thread pydantic/networks.py Outdated
Co-authored-by: Victorien <65306057+Viicos@users.noreply.github.com>
@AlekseyLobanov
Copy link
Copy Markdown
Contributor Author

Thanks, this looks reasonable but to be extra careful we'll wait for other reviews as well.

According to Wikipedia it is one of the valid DoS attack vectors. And at least some of known to me rate limiters will work only after validation step.

Could you add the following to the test_address_valid test?

I think that existing tests are already covering this edge cases (spaces before/after the group). Should I still add yours?

        ('foo BAR <foobar@example.com >', 'foo BAR', 'foobar@example.com'),
        ('FOO bar   <foobar@example.com> ', 'FOO bar', 'foobar@example.com'),
        ('Whatever < foobar@example.com>', 'Whatever', 'foobar@example.com'),

@sydney-runkle sydney-runkle added the relnotes-performance Used for performance improvements. label Oct 11, 2024
@sydney-runkle
Copy link
Copy Markdown
Contributor

Yep let's add those extra tests and fix the lints, but otherwise LGTM.

@Viicos
Copy link
Copy Markdown
Member

Viicos commented Oct 11, 2024

According to Wikipedia it is one of the valid DoS attack vectors. And at least some of known to me rate limiters will work only after validation step.

I agree, just wanted to be careful as changing regex can be a source of breaking changes.

I think that existing tests are already covering this edge cases (spaces before/after the group). Should I still add yours?

Missed these ones, then maybe only add these ones after it:

        ('Whatever <foobar@example.com >', 'Whatever', 'foobar@example.com'),
        ('Whatever < foobar@example.com >', 'Whatever', 'foobar@example.com'),

Covering name + email case with spaces surrounding the email
@AlekseyLobanov
Copy link
Copy Markdown
Contributor Author

Missed these ones, then maybe only add these ones after it:

Done.

Copy link
Copy Markdown
Contributor

@sydney-runkle sydney-runkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the help here! We appreciate the thorough explanations / refs :).

@sydney-runkle sydney-runkle enabled auto-merge (squash) October 11, 2024 14:15
@sydney-runkle sydney-runkle merged commit 37d98a8 into pydantic:main Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for review relnotes-fix Used for bugfixes. relnotes-performance Used for performance improvements.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Email parsing slowdown on edgecases

3 participants