Skip to content

fix: improved URL regex#230

Merged
bee-san merged 2 commits intomainfrom
improve-url
Nov 7, 2021
Merged

fix: improved URL regex#230
bee-san merged 2 commits intomainfrom
improve-url

Conversation

@amadejpapez
Copy link
Copy Markdown
Collaborator

@amadejpapez amadejpapez commented Nov 6, 2021

⚠ Pull Requests not made with this template will be automatically closed πŸ”₯

Prerequisites

Why do we need this pull request?

This should fix a few issues we were seeing with URLs. I have went through the regex and modified some parts. There may still be some cases but with this changes I saw a lot better results.

Also added more Examples and https://www.google.com now matches fully.

I have written an explanation for regex from start of the URL till the end to make it easier and quicker to review. Also give feedback, so it can get even better. :)

(?i)(?:(?:https?|ftp):\/\/)?(?:\S+:\S+@)?(?:[a-z0-9-_~]+\.)*[a-z0-9-]{1,62}\.(?:COM|IO|BLOG|ORG|TECH)(?::\d{2,5})?(?:\/[a-z0-9-_~.]+)*(?:[?#]\S*)*\/?

  • http/https/ftp is still optional at the beggining.
  • Subdomains can contain [a-z0-9-_~] with . in-between. Previously this part was matched as whole which caused something like wwww.....google.com to be valid.
  • Domain name can contain [a-z0-9-] and is 1-62 characters long.
  • Valid TLD from our list.
  • There can be a port number specified.
  • Path can contain [a-z0-9-_~.] with / in-between.
  • If there is ? or # characters it basically matches to everything after it until there is a space or line break. I do not think this characters can get any more limited.

What GitHub issues does this fix?

Copy / paste of output

Please copy and paste the output of PyWhat with your new addition using an example that tests this addition below:

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 6, 2021

Codecov Report

Merging #230 (42491e8) into main (071a962) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #230   +/-   ##
=======================================
  Coverage   92.60%   92.60%           
=======================================
  Files          15       15           
  Lines        1217     1217           
=======================================
  Hits         1127     1127           
  Misses         90       90           

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update 071a962...42491e8. Read the comment docs.

@bee-san bee-san enabled auto-merge November 7, 2021 11:09
@bee-san bee-san merged commit a5a4a3b into main Nov 7, 2021
@bee-san bee-san deleted the improve-url branch November 7, 2021 11:16
Copy link
Copy Markdown

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change url generation script

@amadejpapez
Copy link
Copy Markdown
Collaborator Author

Please change url generation script

What change is needed? Regex is no longer hard-coded in there.

@ghost
Copy link
Copy Markdown

ghost commented Nov 7, 2021

Please change url generation script

What change is needed? Regex is no longer hard-coded in there.

Oh, that is great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants