Skip to content

Fix skill frontmatter loading multi-codepoint UTF-8 graphemes at chunk boundary #57466

Merged
rtfeldman merged 3 commits into
zed-industries:mainfrom
roboticsdude60:fix-skill-chunk-boundary-utf8
May 22, 2026
Merged

Fix skill frontmatter loading multi-codepoint UTF-8 graphemes at chunk boundary #57466
rtfeldman merged 3 commits into
zed-industries:mainfrom
roboticsdude60:fix-skill-chunk-boundary-utf8

Conversation

@roboticsdude60

Copy link
Copy Markdown
Contributor

Self-Review Checklist:

  • I've reviewed my own diff for quality, security, and reliability
  • Unsafe blocks (if any) have justifying comments
  • The content is consistent with the UI/UX checklist
  • Tests cover the new/changed behavior
  • Performance impact has been considered and is acceptable

Closes #57463

Release Notes:

  • Fixed utf-8 parsing issues when loading skill frontmatter with multi-codepoint graphemes (such as an emoji) crossing file chunk load boundaries.

Adds a test for `load_skill_frontmatter` which reproduces a bug when a
multi-byte grapheme in the skill body crosses the 4096-byte chunk read
boundary. As commited, the test fails with message `"SKILL.md is not
valid UTF-8: incomplete utf-8 byte sequence from index 4095"`.
…anges

the idea with extracting this constant is to keep the test relevant if the chunk size ever changes in the future.
@cla-bot cla-bot Bot added the cla-signed The user has signed the Contributor License Agreement label May 22, 2026
@zed-community-bot zed-community-bot Bot added the first contribution the author's first pull request to Zed. NOTE: the label application is automated via github actions label May 22, 2026
@roboticsdude60 roboticsdude60 changed the title Fix skill chunk boundary utf8 Fix skill frontmatter loading multi-codepoint utf-8 graphemes at chunk boundary May 22, 2026
@maxdeviant maxdeviant changed the title Fix skill frontmatter loading multi-codepoint utf-8 graphemes at chunk boundary Fix skill frontmatter loading multi-codepoint UTF-8 graphemes at chunk boundary May 22, 2026
@rtfeldman rtfeldman enabled auto-merge May 22, 2026 15:15

@rtfeldman rtfeldman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roboticsdude60! 🎉

@rtfeldman rtfeldman added this pull request to the merge queue May 22, 2026
Merged via the queue into zed-industries:main with commit d3e27ab May 22, 2026
46 checks passed
pull Bot pushed a commit to kp-forks/zed that referenced this pull request May 27, 2026
…7510)

Follow-up to zed-industries#57466 simplifying `load_skill_frontmatter` rather than
fixing a separate bug.

zed-industries#57466 fixed the chunked-read loop so it truncates accumulated bytes at
the frontmatter boundary, preventing `str::from_utf8` from failing on a
multi-byte grapheme split across chunks. That fix is correct, but the
chunked read it's working around isn't pulling its weight:
`MAX_SKILL_FILE_SIZE` is 100KB and there's already a metadata pre-check,
so "stop reading early once we've seen the closing `---`" saves at most
~25 pages per file while forcing the use of `open_sync`, a hand-rolled
loop, and a long comment about `smol::unblock` vs the GPUI test
scheduler.

This PR:

- replaces the chunked `open_sync` + read loop with a single
`fs.load(...)` call (the same primitive `read_skill_body` already uses);
- deletes `closing_delimiter_end`, `SKILL_READ_CHUNK_SIZE`, the
`std::io::{self, Read}` import, and the `Parking forbidden` paragraph;
- tightens the metadata pre-check so a metadata error (permissions, I/O,
etc.) bails out instead of silently falling through to a blind read;
- removes the now-obsolete
`test_load_skill_frontmatter_with_emoji_at_chunk_boundary` test — by
construction, a single full read can't split a UTF-8 sequence at any
boundary.

Closes AI-303

Release Notes:

- N/A
MartinYe1234 pushed a commit that referenced this pull request May 27, 2026
…pick to preview) (#57831)

Cherry-pick of #57510 to preview

----
Follow-up to #57466 simplifying `load_skill_frontmatter` rather than
fixing a separate bug.

the frontmatter boundary, preventing `str::from_utf8` from failing on a
multi-byte grapheme split across chunks. That fix is correct, but the
chunked read it's working around isn't pulling its weight:
`MAX_SKILL_FILE_SIZE` is 100KB and there's already a metadata pre-check,
so "stop reading early once we've seen the closing `---`" saves at most
~25 pages per file while forcing the use of `open_sync`, a hand-rolled
loop, and a long comment about `smol::unblock` vs the GPUI test
scheduler.

This PR:

- replaces the chunked `open_sync` + read loop with a single
`fs.load(...)` call (the same primitive `read_skill_body` already uses);
- deletes `closing_delimiter_end`, `SKILL_READ_CHUNK_SIZE`, the
`std::io::{self, Read}` import, and the `Parking forbidden` paragraph;
- tightens the metadata pre-check so a metadata error (permissions, I/O,
etc.) bails out instead of silently falling through to a blind read;
- removes the now-obsolete
`test_load_skill_frontmatter_with_emoji_at_chunk_boundary` test — by
construction, a single full read can't split a UTF-8 sequence at any
boundary.

Closes AI-303

Release Notes:

- N/A
@roboticsdude60 roboticsdude60 deleted the fix-skill-chunk-boundary-utf8 branch May 28, 2026 16:34
TomPlanche pushed a commit to TomPlanche/zed that referenced this pull request Jun 2, 2026
…k boundary (zed-industries#57466)

Self-Review Checklist:

- [X] I've reviewed my own diff for quality, security, and reliability
- [X] Unsafe blocks (if any) have justifying comments
- [X] The content is consistent with the [UI/UX
checklist](https://github.com/zed-industries/zed/blob/main/CONTRIBUTING.md#uiux-checklist)
- [X] Tests cover the new/changed behavior
- [X] Performance impact has been considered and is acceptable

Closes zed-industries#57463

Release Notes:

- Fixed utf-8 parsing issues when loading skill frontmatter with
multi-codepoint graphemes (such as an emoji) crossing file chunk load
boundaries.
TomPlanche pushed a commit to TomPlanche/zed that referenced this pull request Jun 2, 2026
…7510)

Follow-up to zed-industries#57466 simplifying `load_skill_frontmatter` rather than
fixing a separate bug.

zed-industries#57466 fixed the chunked-read loop so it truncates accumulated bytes at
the frontmatter boundary, preventing `str::from_utf8` from failing on a
multi-byte grapheme split across chunks. That fix is correct, but the
chunked read it's working around isn't pulling its weight:
`MAX_SKILL_FILE_SIZE` is 100KB and there's already a metadata pre-check,
so "stop reading early once we've seen the closing `---`" saves at most
~25 pages per file while forcing the use of `open_sync`, a hand-rolled
loop, and a long comment about `smol::unblock` vs the GPUI test
scheduler.

This PR:

- replaces the chunked `open_sync` + read loop with a single
`fs.load(...)` call (the same primitive `read_skill_body` already uses);
- deletes `closing_delimiter_end`, `SKILL_READ_CHUNK_SIZE`, the
`std::io::{self, Read}` import, and the `Parking forbidden` paragraph;
- tightens the metadata pre-check so a metadata error (permissions, I/O,
etc.) bails out instead of silently falling through to a blind read;
- removes the now-obsolete
`test_load_skill_frontmatter_with_emoji_at_chunk_boundary` test — by
construction, a single full read can't split a UTF-8 sequence at any
boundary.

Closes AI-303

Release Notes:

- N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The user has signed the Contributor License Agreement first contribution the author's first pull request to Zed. NOTE: the label application is automated via github actions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SKILL.md with multi-codepoint grapheme at chunk boundary fails to load

2 participants