Hi, i ran into this same issue today,
It seems it is not only because of recursive lookup of sub-sitemap.xml links; it seems it can't extract links from sitemap.xml files accessed via http?
Apple's sitemap(https://www.apple.com/sitemap.xml) for example, does not have sub sitemap xmls - they had the actual links listed.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://www.apple.com/</loc></url>
<url><loc>https://www.apple.com/accessibility/</loc></url>
<url><loc>https://www.apple.com/accessibility/assistive-technologies/</loc></url>
<url><loc>https://www.apple.com/accessibility/designed-for-students/</loc></url>
...
</urlset>
Running lychee on this sitemap url does not yeild links
lychee --version && lychee https://www.apple.com/sitemap.xml
lychee 0.23.0
0/0 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links 🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors
But downloading the same sitemap xml and running lychee on the downloaded file is working fine
lychee --version && lychee ./sitemap.xml
lychee 0.23.0
102/360 ━━━━━ ━━━━━━━━━━━━━━ [200] https://www.apple.com/feedback/calendar/
^C
Originally posted by @mattr-rsolomon in #1819
I did some quick investigation into https://www.apple.com/sitemap.xml, and it looks like the issue is that lychee is assuming that websites are HTML. The XML appears to have no links because the links are placed where ordinary non-link text would appear in HTML.
I also note that downloading the file isn't a perfect workaround because it treats the file as plaintext. XML entities won't be interpreted. You can see this in lychee --dump.
https://www.apple.com/us/search/product-red?f=ipadpro_12_9_2&fh=48e4&sel=accessories
I think that we should add an XML file type which would help a lot with sitemaps. However, there's lots of different XML schemas so idk how we should handle them.
Originally posted by @mattr-rsolomon in #1819
I did some quick investigation into https://www.apple.com/sitemap.xml, and it looks like the issue is that lychee is assuming that websites are HTML. The XML appears to have no links because the links are placed where ordinary non-link text would appear in HTML.
I also note that downloading the file isn't a perfect workaround because it treats the file as plaintext. XML entities won't be interpreted. You can see this in
lychee --dump.I think that we should add an XML file type which would help a lot with sitemaps. However, there's lots of different XML schemas so idk how we should handle them.