Skip to content

Remote XML files are treated as HTML (including sitemap.xml) #2062

@katrinafyi

Description

@katrinafyi

Hi, i ran into this same issue today,
It seems it is not only because of recursive lookup of sub-sitemap.xml links; it seems it can't extract links from sitemap.xml files accessed via http?

Apple's sitemap(https://www.apple.com/sitemap.xml) for example, does not have sub sitemap xmls - they had the actual links listed.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://www.apple.com/</loc></url>
<url><loc>https://www.apple.com/accessibility/</loc></url>
<url><loc>https://www.apple.com/accessibility/assistive-technologies/</loc></url>
<url><loc>https://www.apple.com/accessibility/designed-for-students/</loc></url>
...
</urlset>

Running lychee on this sitemap url does not yeild links

lychee --version && lychee https://www.apple.com/sitemap.xml
lychee 0.23.0
0/0 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                                                                      🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors

But downloading the same sitemap xml and running lychee on the downloaded file is working fine

lychee --version && lychee ./sitemap.xml 
lychee 0.23.0
102/360 ━━━━━ ━━━━━━━━━━━━━━      [200] https://www.apple.com/feedback/calendar/
^C

Originally posted by @mattr-rsolomon in #1819

I did some quick investigation into https://www.apple.com/sitemap.xml, and it looks like the issue is that lychee is assuming that websites are HTML. The XML appears to have no links because the links are placed where ordinary non-link text would appear in HTML.

I also note that downloading the file isn't a perfect workaround because it treats the file as plaintext. XML entities won't be interpreted. You can see this in lychee --dump.

https://www.apple.com/us/search/product-red?f=ipadpro_12_9_2&amp;fh=48e4&amp;sel=accessories

I think that we should add an XML file type which would help a lot with sitemaps. However, there's lots of different XML schemas so idk how we should handle them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions