Skip to content

Is there a valid reason to check data: URIs? #661

@polarathene

Description

@polarathene

Not sure if this is a bug. data: URIs in an HTML document are identified as links (ok), but only ignored when checking (--dump does not ignore, which is fine), yet the error is "Unsupported The given URI is invalid".

I'm not sure when it would be valid to check? data: URI would be local data AFAIK, so always valid. The contents might contain additional links that could be parsed, similar to a linked document (which presently I don't think is supported). In this case a w3.org link is present which seems to still get extracted from it's contents and excluded). I assume that extraction is unrelated to being part of the URI, and that other encodings such as base64 would not be parsed to check links?

Just raising an issue if it actually makes sense to parse these URI values for checking. I can understand that it may still be useful with --dump to collect such URIs from a document, and that one can opt out with --exclude 'data:.*' 👍

Repeating the same content/URI that was just listed for the error also seems redundant?


Reproduction

I tested lychee against this 404 page (Source on Github). That is intended as a 404 page (with response) however, so I wasn't sure if lychee would parse it (it does for inputs, which is fine). Below I tested with a local copy and webserver at localhost instead:

$ lychee localhost/index.html

🔍 6 Total ✅ 5 OK 🚫 0 Errors 💤 1 Excluded

# Verbose:
$ lychee -v localhost/index.html

? [IGNORED] data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e | Unsupported The given URI is invalid: data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e
? [IGNORED] data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e | Unsupported The given URI is invalid: data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e
? [EXCLUDED] http://www.w3.org/2000/svg | Excluded
✔ [200] http://localhost/assets/img/bg-water.webp
✔ [200] http://localhost/assets/img/bg-water.webp
✔ [200] https://docker-mailserver.github.io/docker-mailserver/edge/
✔ [200] https://www.netlify.com/
✔ [200] https://github.com/docker-mailserver/docker-mailserver

🔍 6 Total ✅ 5 OK 🚫 0 Errors 💤 1 Excluded

--dump:

$ lychee --dump localhost/index.html

https://www.netlify.com/
https://docker-mailserver.github.io/docker-mailserver/edge/
https://github.com/docker-mailserver/docker-mailserver
data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e
data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e
http://localhost/assets/img/bg-water.webp
http://localhost/assets/img/bg-water.webp

Verbose --dump (data URIs not ignored or considered invalid yet, just scraped?):

$ lychee -v --dump localhost/index.html

http://www.w3.org/2000/svg (http://localhost/index.html) [excluded]
https://docker-mailserver.github.io/docker-mailserver/edge/ (http://localhost/index.html)
http://localhost/assets/img/bg-water.webp (http://localhost/index.html)
https://www.netlify.com/ (http://localhost/index.html)
data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e (http://localhost/index.html)
data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e (http://localhost/index.html)
http://localhost/assets/img/bg-water.webp (http://localhost/index.html)
https://github.com/docker-mailserver/docker-mailserver (http://localhost/index.html)

Listing/checking the same link multiple times

In both the actual link checking, and with the --dump option, I noticed lychee output the link http://localhost/assets/img/bg-water.webp twice. Despite the output, the link is relative within that same document (assets/img/bg-water.webp), not sure if that affects any logic involved.

I can understand that some users may want to have duplicate links listed (perhaps more with --dump, eg for statistics), but seemed a bit odd to appear twice in the report 😅

I assume without --cache the request is made twice, but it seems this option is currently broken to see if it changes behaviour in the same run as cache is built, and if that alters the output at all. The first run with the option still listed the same link twice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions