Not sure if this is a bug. data: URIs in an HTML document are identified as links (ok), but only ignored when checking (--dump does not ignore, which is fine), yet the error is "Unsupported The given URI is invalid".
I'm not sure when it would be valid to check? data: URI would be local data AFAIK, so always valid. The contents might contain additional links that could be parsed, similar to a linked document (which presently I don't think is supported). In this case a w3.org link is present which seems to still get extracted from it's contents and excluded). I assume that extraction is unrelated to being part of the URI, and that other encodings such as base64 would not be parsed to check links?
Just raising an issue if it actually makes sense to parse these URI values for checking. I can understand that it may still be useful with --dump to collect such URIs from a document, and that one can opt out with --exclude 'data:.*' 👍
Repeating the same content/URI that was just listed for the error also seems redundant?
Reproduction
I tested lychee against this 404 page (Source on Github). That is intended as a 404 page (with response) however, so I wasn't sure if lychee would parse it (it does for inputs, which is fine). Below I tested with a local copy and webserver at localhost instead:
$ lychee localhost/index.html
🔍 6 Total ✅ 5 OK 🚫 0 Errors 💤 1 Excluded
# Verbose:
$ lychee -v localhost/index.html
? [IGNORED] data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e | Unsupported The given URI is invalid: data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e
? [IGNORED] data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e | Unsupported The given URI is invalid: data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e
? [EXCLUDED] http://www.w3.org/2000/svg | Excluded
✔ [200] http://localhost/assets/img/bg-water.webp
✔ [200] http://localhost/assets/img/bg-water.webp
✔ [200] https://docker-mailserver.github.io/docker-mailserver/edge/
✔ [200] https://www.netlify.com/
✔ [200] https://github.com/docker-mailserver/docker-mailserver
🔍 6 Total ✅ 5 OK 🚫 0 Errors 💤 1 Excluded
--dump:
$ lychee --dump localhost/index.html
https://www.netlify.com/
https://docker-mailserver.github.io/docker-mailserver/edge/
https://github.com/docker-mailserver/docker-mailserver
data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e
data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e
http://localhost/assets/img/bg-water.webp
http://localhost/assets/img/bg-water.webp
Verbose --dump (data URIs not ignored or considered invalid yet, just scraped?):
$ lychee -v --dump localhost/index.html
http://www.w3.org/2000/svg (http://localhost/index.html) [excluded]
https://docker-mailserver.github.io/docker-mailserver/edge/ (http://localhost/index.html)
http://localhost/assets/img/bg-water.webp (http://localhost/index.html)
https://www.netlify.com/ (http://localhost/index.html)
data:image/svg+xml,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 147 40' /%3e (http://localhost/index.html)
data:image/svg+xml,%3csvg viewBox='20 244 512 512' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M122 490h172l70-27a10 10 0 0 0 6-12l-55-146a10 10 0 0 0-13-5L90 380a10 10 0 0 0-6 12z' fill='%23f3ac47'/%3e%3cpath d='m294 490 70-27a10 10 0 0 0 5-5l-149-54-46 86z' fill='%23f19a3d'/%3e%3cpath d='m84 387 150 53 75-140a10 10 0 0 0-7 0L90 380a10 10 0 0 0-6 6z' fill='%23ffd15c'/%3e%3cg%3e%3cpath d='M523 462c-1-1-31-20-59-15-7-29-33-47-35-48-4-3-9-2-13 1-2 3-20 23-16 69 1 3 0 7-2 9-1 2-4 3-7 3H36.6c-2.6 0-5.4 1-7.6 3-2 2-3 5-3 8 0 160 117 177 168 177 129 0 219-78 258-148 52-8 74-43 75-45 3-5 1-11-4-14z' fill='%23303c42'/%3e%3cpath d='M445 501c-4 1-7 3-8 6-35 65-120 142-243 142-54 0-142-20-147-147h344c9 0 17-4 23-10 6-7 9-16 8-25-2-21 1-35 4-43 8 7 19 20 19 36 0 4 1 7 5 9 3 2 6 2 10 1 12-7 30-1 42 4-9 10-27 24-57 27z' fill='%2342a5f5'/%3e%3cpath d='M445 491c-4 0-7 2-8 5-35 66-120 142-243 142-52 0-137-18-146-136h-1c5 127 93 147 147 147 123 0 208-77 243-142 1-3 4-5 8-6 30-3 48-17 57-27-3-1-5-2-8-3-10 8-26 17-49 20z' opacity='.1'/%3e%3ccircle cx='132' cy='565' r='21' fill='%23303c42'/%3e%3ccircle cx='141' cy='559' r='6.76' fill='white'/%3e%3c/g%3e%3c/svg%3e (http://localhost/index.html)
http://localhost/assets/img/bg-water.webp (http://localhost/index.html)
https://github.com/docker-mailserver/docker-mailserver (http://localhost/index.html)
Listing/checking the same link multiple times
In both the actual link checking, and with the --dump option, I noticed lychee output the link http://localhost/assets/img/bg-water.webp twice. Despite the output, the link is relative within that same document (assets/img/bg-water.webp), not sure if that affects any logic involved.
I can understand that some users may want to have duplicate links listed (perhaps more with --dump, eg for statistics), but seemed a bit odd to appear twice in the report 😅
I assume without --cache the request is made twice, but it seems this option is currently broken to see if it changes behaviour in the same run as cache is built, and if that alters the output at all. The first run with the option still listed the same link twice.
Not sure if this is a bug.
data:URIs in an HTML document are identified as links (ok), but only ignored when checking (--dumpdoes not ignore, which is fine), yet the error is "Unsupported The given URI is invalid".I'm not sure when it would be valid to check?
data:URI would be local data AFAIK, so always valid. The contents might contain additional links that could be parsed, similar to a linked document (which presently I don't think is supported). In this case aw3.orglink is present which seems to still get extracted from it's contents and excluded). I assume that extraction is unrelated to being part of the URI, and that other encodings such as base64 would not be parsed to check links?Just raising an issue if it actually makes sense to parse these URI values for checking. I can understand that it may still be useful with
--dumpto collect such URIs from a document, and that one can opt out with--exclude 'data:.*'👍Repeating the same content/URI that was just listed for the error also seems redundant?
Reproduction
I tested
lycheeagainst this 404 page (Source on Github). That is intended as a 404 page (with response) however, so I wasn't sure iflycheewould parse it (it does for inputs, which is fine). Below I tested with a local copy and webserver atlocalhostinstead:--dump:Verbose
--dump(dataURIs not ignored or considered invalid yet, just scraped?):Listing/checking the same link multiple times
In both the actual link checking, and with the
--dumpoption, I noticedlycheeoutput the linkhttp://localhost/assets/img/bg-water.webptwice. Despite the output, the link is relative within that same document (assets/img/bg-water.webp), not sure if that affects any logic involved.I can understand that some users may want to have duplicate links listed (perhaps more with
--dump, eg for statistics), but seemed a bit odd to appear twice in the report 😅I assume without
--cachethe request is made twice, but it seems this option is currently broken to see if it changes behaviour in the same run as cache is built, and if that alters the output at all. The first run with the option still listed the same link twice.