Skip to content

feat: Support sitemap.xml#2071

Merged
thomas-zahner merged 16 commits into
lycheeverse:masterfrom
cristiklein:support-sitemap-xml
Mar 13, 2026
Merged

feat: Support sitemap.xml#2071
thomas-zahner merged 16 commits into
lycheeverse:masterfrom
cristiklein:support-sitemap-xml

Conversation

@cristiklein

Copy link
Copy Markdown
Contributor

Fixes #2062

This PR adds support for extracting links from <loc> tags from sitemap.xml. The implementation is kept minimal, relying on a regex, similar to how links are extracted from CSS.

In future, the XML extractor could be extended with support for links in SVGs and perhaps a proper XML parser. However, those are out-of-scope for this PR.

@mre mre left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're on a roll! Great work! 🚀

I've added a few comments. 😊

Comment thread lychee-lib/src/extract/xml.rs
Comment thread lychee-lib/src/types/file.rs
Comment thread lychee-lib/src/types/file.rs
Comment thread lychee-lib/src/extract/xml.rs
@thomas-zahner

thomas-zahner commented Mar 11, 2026

Copy link
Copy Markdown
Member

@cristiklein Regarding the clippy warnings/errors, by rebasing or merging master you should be able to resolve the issues. (edit: I did a rebase)

@thomas-zahner thomas-zahner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Comment thread lychee-lib/src/extract/xml.rs
Comment thread lychee-lib/src/extract/xml.rs Outdated
Comment thread lychee-lib/src/extract/xml.rs Outdated
@cristiklein

Copy link
Copy Markdown
Contributor Author

@mre @thomas-zahner Thanks for the feedback on this PR. I believe I addressed all your comments. Otherwise, please let me know. 😄

@thomas-zahner thomas-zahner merged commit 556d424 into lycheeverse:master Mar 13, 2026
7 checks passed
@thomas-zahner

thomas-zahner commented Mar 13, 2026

Copy link
Copy Markdown
Member

@cristiklein Thank you for this cool addition to lychee! 🚀

@cristiklein

Copy link
Copy Markdown
Contributor Author

@cristiklein Thank you for this cool addition to lychee! 🚀

And thank you for the kind words and for shepherding this PR. ❤️

donbeave added a commit to jackin-project/jackin that referenced this pull request Apr 25, 2026
The first Docs workflow run on main after #173 (commit f3f3e5e) failed
in deploy → "Check deployed docs links" with "No files found for this
input source". Root cause: the previous step ran lychee --dump on the
deployed sitemap URL, but lychee 0.23.0 (the lycheeverse/lychee-action
v2 default) only extracts <a href> from HTML and matching patterns from
markdown — it does not parse <loc> entries from XML sitemaps. The dump
produced an empty list and the follow-up --files-from step had nothing
to read.

Upstream already fixed this. lycheeverse/lychee#2071 (merged
2026-03-13, tagged in v0.24.0 on 2026-04-24) adds <loc> extraction
from sitemap.xml, closing lycheeverse/lychee#2062 and #1819. Verified
locally on 0.24.0:

  $ lychee --version
  lychee 0.24.0
  $ lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Pin LYCHEE_VERSION at the workflow env level and reference it from
every lychee-action call so future bumps are one-line. v0.24.0's
breaking changes are in lychee-lib (the Rust API consumers); the CLI
surface we use is unchanged.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
donbeave added a commit to jackin-project/jackin that referenced this pull request Apr 25, 2026
Replace the previous v0.24.0 bump with the only combination that
actually works against the current lychee release pipeline:

  - lycheeverse/lychee-action SHA 8646ba3 (tagged v2.8.0) → faea714
    (post-v2.8.0 master). Adds subfolder-aware install needed for any
    lychee 0.24.x tarball.
  - LYCHEE_VERSION 'v0.24.0' → 'v0.24.1'.

Why both moves:

* lychee 0.24.0 added <loc> extraction from XML sitemaps
  (lycheeverse/lychee#2071), which is what the deploy and check-deployed
  jobs need to feed --files-from. lychee 0.23.0 dumps zero links from a
  sitemap, which is what produced the "No files found for this input
  source" failure on f3f3e5e.
* lychee 0.24.0's release tarball was repackaged with a top-level
  subfolder AND the asset filename was renamed to
  lychee-lychee-v0.24.0-{arch}-... — both incompatible with
  lychee-action v2.8.0's hardcoded download URL and flat-extract logic.
* lychee 0.24.1 (released the same day) reverted to the original asset
  filename but kept the subfolder layout AND kept the sitemap fix.
* lychee-action faea714 (unreleased; current HEAD of master) bumps the
  default to 0.24.1 and adds subfolder-aware install. Pinning the SHA
  is the same security model we already use for v2.8.0.

The combination 8646ba3 + 'latest' or 8646ba3 + 'v0.24.x' both fail.
The combination faea714 + 'v0.24.1' works.

Verified locally:

  $ lychee-v0.24.1/lychee --version
  lychee 0.24.1
  $ lychee-v0.24.1/lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
donbeave added a commit to jackin-project/jackin that referenced this pull request Apr 25, 2026
)

* ci(docs): bump lychee to v0.24.0 to fix sitemap URL extraction

The first Docs workflow run on main after #173 (commit f3f3e5e) failed
in deploy → "Check deployed docs links" with "No files found for this
input source". Root cause: the previous step ran lychee --dump on the
deployed sitemap URL, but lychee 0.23.0 (the lycheeverse/lychee-action
v2 default) only extracts <a href> from HTML and matching patterns from
markdown — it does not parse <loc> entries from XML sitemaps. The dump
produced an empty list and the follow-up --files-from step had nothing
to read.

Upstream already fixed this. lycheeverse/lychee#2071 (merged
2026-03-13, tagged in v0.24.0 on 2026-04-24) adds <loc> extraction
from sitemap.xml, closing lycheeverse/lychee#2062 and #1819. Verified
locally on 0.24.0:

  $ lychee --version
  lychee 0.24.0
  $ lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Pin LYCHEE_VERSION at the workflow env level and reference it from
every lychee-action call so future bumps are one-line. v0.24.0's
breaking changes are in lychee-lib (the Rust API consumers); the CLI
surface we use is unchanged.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

* ci(docs): bump lychee-action and lychee for sitemap URL extraction

Replace the previous v0.24.0 bump with the only combination that
actually works against the current lychee release pipeline:

  - lycheeverse/lychee-action SHA 8646ba3 (tagged v2.8.0) → faea714
    (post-v2.8.0 master). Adds subfolder-aware install needed for any
    lychee 0.24.x tarball.
  - LYCHEE_VERSION 'v0.24.0' → 'v0.24.1'.

Why both moves:

* lychee 0.24.0 added <loc> extraction from XML sitemaps
  (lycheeverse/lychee#2071), which is what the deploy and check-deployed
  jobs need to feed --files-from. lychee 0.23.0 dumps zero links from a
  sitemap, which is what produced the "No files found for this input
  source" failure on f3f3e5e.
* lychee 0.24.0's release tarball was repackaged with a top-level
  subfolder AND the asset filename was renamed to
  lychee-lychee-v0.24.0-{arch}-... — both incompatible with
  lychee-action v2.8.0's hardcoded download URL and flat-extract logic.
* lychee 0.24.1 (released the same day) reverted to the original asset
  filename but kept the subfolder layout AND kept the sitemap fix.
* lychee-action faea714 (unreleased; current HEAD of master) bumps the
  default to 0.24.1 and adds subfolder-aware install. Pinning the SHA
  is the same security model we already use for v2.8.0.

The combination 8646ba3 + 'latest' or 8646ba3 + 'v0.24.x' both fail.
The combination faea714 + 'v0.24.1' works.

Verified locally:

  $ lychee-v0.24.1/lychee --version
  lychee 0.24.1
  $ lychee-v0.24.1/lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

* ci(docs): add TODO(lychee-action-sha-pin) marker

Companion to #179, which establishes the convention. Mark the spot
where the SHA pin needs to be reverted once lycheeverse/lychee-action
cuts a tagged release at or after faea714, with a back-link to the
tracked entry in TODO.md so a single grep finds both ends.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

---------

Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com>
donbeave added a commit to jackin-project/jackin that referenced this pull request May 6, 2026
)

* ci(docs): bump lychee to v0.24.0 to fix sitemap URL extraction

The first Docs workflow run on main after #173 (commit f3f3e5e) failed
in deploy → "Check deployed docs links" with "No files found for this
input source". Root cause: the previous step ran lychee --dump on the
deployed sitemap URL, but lychee 0.23.0 (the lycheeverse/lychee-action
v2 default) only extracts <a href> from HTML and matching patterns from
markdown — it does not parse <loc> entries from XML sitemaps. The dump
produced an empty list and the follow-up --files-from step had nothing
to read.

Upstream already fixed this. lycheeverse/lychee#2071 (merged
2026-03-13, tagged in v0.24.0 on 2026-04-24) adds <loc> extraction
from sitemap.xml, closing lycheeverse/lychee#2062 and #1819. Verified
locally on 0.24.0:

  $ lychee --version
  lychee 0.24.0
  $ lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Pin LYCHEE_VERSION at the workflow env level and reference it from
every lychee-action call so future bumps are one-line. v0.24.0's
breaking changes are in lychee-lib (the Rust API consumers); the CLI
surface we use is unchanged.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

* ci(docs): bump lychee-action and lychee for sitemap URL extraction

Replace the previous v0.24.0 bump with the only combination that
actually works against the current lychee release pipeline:

  - lycheeverse/lychee-action SHA 8646ba3 (tagged v2.8.0) → faea714
    (post-v2.8.0 master). Adds subfolder-aware install needed for any
    lychee 0.24.x tarball.
  - LYCHEE_VERSION 'v0.24.0' → 'v0.24.1'.

Why both moves:

* lychee 0.24.0 added <loc> extraction from XML sitemaps
  (lycheeverse/lychee#2071), which is what the deploy and check-deployed
  jobs need to feed --files-from. lychee 0.23.0 dumps zero links from a
  sitemap, which is what produced the "No files found for this input
  source" failure on f3f3e5e.
* lychee 0.24.0's release tarball was repackaged with a top-level
  subfolder AND the asset filename was renamed to
  lychee-lychee-v0.24.0-{arch}-... — both incompatible with
  lychee-action v2.8.0's hardcoded download URL and flat-extract logic.
* lychee 0.24.1 (released the same day) reverted to the original asset
  filename but kept the subfolder layout AND kept the sitemap fix.
* lychee-action faea714 (unreleased; current HEAD of master) bumps the
  default to 0.24.1 and adds subfolder-aware install. Pinning the SHA
  is the same security model we already use for v2.8.0.

The combination 8646ba3 + 'latest' or 8646ba3 + 'v0.24.x' both fail.
The combination faea714 + 'v0.24.1' works.

Verified locally:

  $ lychee-v0.24.1/lychee --version
  lychee 0.24.1
  $ lychee-v0.24.1/lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

* ci(docs): add TODO(lychee-action-sha-pin) marker

Companion to #179, which establishes the convention. Mark the spot
where the SHA pin needs to be reverted once lycheeverse/lychee-action
cuts a tagged release at or after faea714, with a back-link to the
tracked entry in TODO.md so a single grep finds both ends.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>

---------

Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com>
donbeave added a commit to jackin-project/jackin that referenced this pull request May 7, 2026
)

* ci(docs): bump lychee to v0.24.0 to fix sitemap URL extraction

The first Docs workflow run on main after #173 (commit f3f3e5e) failed
in deploy → "Check deployed docs links" with "No files found for this
input source". Root cause: the previous step ran lychee --dump on the
deployed sitemap URL, but lychee 0.23.0 (the lycheeverse/lychee-action
v2 default) only extracts <a href> from HTML and matching patterns from
markdown — it does not parse <loc> entries from XML sitemaps. The dump
produced an empty list and the follow-up --files-from step had nothing
to read.

Upstream already fixed this. lycheeverse/lychee#2071 (merged
2026-03-13, tagged in v0.24.0 on 2026-04-24) adds <loc> extraction
from sitemap.xml, closing lycheeverse/lychee#2062 and #1819. Verified
locally on 0.24.0:

  $ lychee --version
  lychee 0.24.0
  $ lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Pin LYCHEE_VERSION at the workflow env level and reference it from
every lychee-action call so future bumps are one-line. v0.24.0's
breaking changes are in lychee-lib (the Rust API consumers); the CLI
surface we use is unchanged.

Co-authored-by: Claude <noreply@anthropic.com>

* ci(docs): bump lychee-action and lychee for sitemap URL extraction

Replace the previous v0.24.0 bump with the only combination that
actually works against the current lychee release pipeline:

  - lycheeverse/lychee-action SHA 8646ba3 (tagged v2.8.0) → faea714
    (post-v2.8.0 master). Adds subfolder-aware install needed for any
    lychee 0.24.x tarball.
  - LYCHEE_VERSION 'v0.24.0' → 'v0.24.1'.

Why both moves:

* lychee 0.24.0 added <loc> extraction from XML sitemaps
  (lycheeverse/lychee#2071), which is what the deploy and check-deployed
  jobs need to feed --files-from. lychee 0.23.0 dumps zero links from a
  sitemap, which is what produced the "No files found for this input
  source" failure on f3f3e5e.
* lychee 0.24.0's release tarball was repackaged with a top-level
  subfolder AND the asset filename was renamed to
  lychee-lychee-v0.24.0-{arch}-... — both incompatible with
  lychee-action v2.8.0's hardcoded download URL and flat-extract logic.
* lychee 0.24.1 (released the same day) reverted to the original asset
  filename but kept the subfolder layout AND kept the sitemap fix.
* lychee-action faea714 (unreleased; current HEAD of master) bumps the
  default to 0.24.1 and adds subfolder-aware install. Pinning the SHA
  is the same security model we already use for v2.8.0.

The combination 8646ba3 + 'latest' or 8646ba3 + 'v0.24.x' both fail.
The combination faea714 + 'v0.24.1' works.

Verified locally:

  $ lychee-v0.24.1/lychee --version
  lychee 0.24.1
  $ lychee-v0.24.1/lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Co-authored-by: Claude <noreply@anthropic.com>

* ci(docs): add TODO(lychee-action-sha-pin) marker

Companion to #179, which establishes the convention. Mark the spot
where the SHA pin needs to be reverted once lycheeverse/lychee-action
cuts a tagged release at or after faea714, with a back-link to the
tracked entry in TODO.md so a single grep finds both ends.

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Codex <codex@openai.com>
donbeave added a commit to jackin-project/jackin that referenced this pull request May 7, 2026
)

* ci(docs): bump lychee to v0.24.0 to fix sitemap URL extraction

The first Docs workflow run on main after #173 (commit f3f3e5e) failed
in deploy → "Check deployed docs links" with "No files found for this
input source". Root cause: the previous step ran lychee --dump on the
deployed sitemap URL, but lychee 0.23.0 (the lycheeverse/lychee-action
v2 default) only extracts <a href> from HTML and matching patterns from
markdown — it does not parse <loc> entries from XML sitemaps. The dump
produced an empty list and the follow-up --files-from step had nothing
to read.

Upstream already fixed this. lycheeverse/lychee#2071 (merged
2026-03-13, tagged in v0.24.0 on 2026-04-24) adds <loc> extraction
from sitemap.xml, closing lycheeverse/lychee#2062 and #1819. Verified
locally on 0.24.0:

  $ lychee --version
  lychee 0.24.0
  $ lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Pin LYCHEE_VERSION at the workflow env level and reference it from
every lychee-action call so future bumps are one-line. v0.24.0's
breaking changes are in lychee-lib (the Rust API consumers); the CLI
surface we use is unchanged.


* ci(docs): bump lychee-action and lychee for sitemap URL extraction

Replace the previous v0.24.0 bump with the only combination that
actually works against the current lychee release pipeline:

  - lycheeverse/lychee-action SHA 8646ba3 (tagged v2.8.0) → faea714
    (post-v2.8.0 master). Adds subfolder-aware install needed for any
    lychee 0.24.x tarball.
  - LYCHEE_VERSION 'v0.24.0' → 'v0.24.1'.

Why both moves:

* lychee 0.24.0 added <loc> extraction from XML sitemaps
  (lycheeverse/lychee#2071), which is what the deploy and check-deployed
  jobs need to feed --files-from. lychee 0.23.0 dumps zero links from a
  sitemap, which is what produced the "No files found for this input
  source" failure on f3f3e5e.
* lychee 0.24.0's release tarball was repackaged with a top-level
  subfolder AND the asset filename was renamed to
  lychee-lychee-v0.24.0-{arch}-... — both incompatible with
  lychee-action v2.8.0's hardcoded download URL and flat-extract logic.
* lychee 0.24.1 (released the same day) reverted to the original asset
  filename but kept the subfolder layout AND kept the sitemap fix.
* lychee-action faea714 (unreleased; current HEAD of master) bumps the
  default to 0.24.1 and adds subfolder-aware install. Pinning the SHA
  is the same security model we already use for v2.8.0.

The combination 8646ba3 + 'latest' or 8646ba3 + 'v0.24.x' both fail.
The combination faea714 + 'v0.24.1' works.

Verified locally:

  $ lychee-v0.24.1/lychee --version
  lychee 0.24.1
  $ lychee-v0.24.1/lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45


* ci(docs): add TODO(lychee-action-sha-pin) marker

Companion to #179, which establishes the convention. Mark the spot
where the SHA pin needs to be reverted once lycheeverse/lychee-action
cuts a tagged release at or after faea714, with a back-link to the
tracked entry in TODO.md so a single grep finds both ends.


---------

Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com>
donbeave added a commit to jackin-project/jackin that referenced this pull request May 7, 2026
)

* ci(docs): bump lychee to v0.24.0 to fix sitemap URL extraction

The first Docs workflow run on main after #173 (commit f3f3e5e) failed
in deploy → "Check deployed docs links" with "No files found for this
input source". Root cause: the previous step ran lychee --dump on the
deployed sitemap URL, but lychee 0.23.0 (the lycheeverse/lychee-action
v2 default) only extracts <a href> from HTML and matching patterns from
markdown — it does not parse <loc> entries from XML sitemaps. The dump
produced an empty list and the follow-up --files-from step had nothing
to read.

Upstream already fixed this. lycheeverse/lychee#2071 (merged
2026-03-13, tagged in v0.24.0 on 2026-04-24) adds <loc> extraction
from sitemap.xml, closing lycheeverse/lychee#2062 and #1819. Verified
locally on 0.24.0:

  $ lychee --version
  lychee 0.24.0
  $ lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45

Pin LYCHEE_VERSION at the workflow env level and reference it from
every lychee-action call so future bumps are one-line. v0.24.0's
breaking changes are in lychee-lib (the Rust API consumers); the CLI
surface we use is unchanged.


* ci(docs): bump lychee-action and lychee for sitemap URL extraction

Replace the previous v0.24.0 bump with the only combination that
actually works against the current lychee release pipeline:

  - lycheeverse/lychee-action SHA 8646ba3 (tagged v2.8.0) → faea714
    (post-v2.8.0 master). Adds subfolder-aware install needed for any
    lychee 0.24.x tarball.
  - LYCHEE_VERSION 'v0.24.0' → 'v0.24.1'.

Why both moves:

* lychee 0.24.0 added <loc> extraction from XML sitemaps
  (lycheeverse/lychee#2071), which is what the deploy and check-deployed
  jobs need to feed --files-from. lychee 0.23.0 dumps zero links from a
  sitemap, which is what produced the "No files found for this input
  source" failure on f3f3e5e.
* lychee 0.24.0's release tarball was repackaged with a top-level
  subfolder AND the asset filename was renamed to
  lychee-lychee-v0.24.0-{arch}-... — both incompatible with
  lychee-action v2.8.0's hardcoded download URL and flat-extract logic.
* lychee 0.24.1 (released the same day) reverted to the original asset
  filename but kept the subfolder layout AND kept the sitemap fix.
* lychee-action faea714 (unreleased; current HEAD of master) bumps the
  default to 0.24.1 and adds subfolder-aware install. Pinning the SHA
  is the same security model we already use for v2.8.0.

The combination 8646ba3 + 'latest' or 8646ba3 + 'v0.24.x' both fail.
The combination faea714 + 'v0.24.1' works.

Verified locally:

  $ lychee-v0.24.1/lychee --version
  lychee 0.24.1
  $ lychee-v0.24.1/lychee --dump https://jackin.tailrocks.com/sitemap-0.xml | wc -l
  45


* ci(docs): add TODO(lychee-action-sha-pin) marker

Companion to #179, which establishes the convention. Mark the spot
where the SHA pin needs to be reverted once lycheeverse/lychee-action
cuts a tagged release at or after faea714, with a back-link to the
tracked entry in TODO.md so a single grep finds both ends.


---------

Signed-off-by: Alexey Zhokhov <alexey@zhokhov.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote XML files are treated as HTML (including sitemap.xml)

3 participants