perf(utils): avoid allocation in default_sanitize_file_name for clean names#9928
Merged
Conversation
✅ Deploy Preview for rolldown-rs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
d407fa2 to
c4bd423
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Optimizes rolldown_utils’s filename sanitization hot path by avoiding an unconditional allocation/copy for already-valid output filenames, which are the common case during chunk/asset emission.
Changes:
- Change
default_sanitize_file_nameto returnCow<'_, str>and early-returnCow::Borrowedwhen no invalid characters are present. - Replace the previous per-
charrebuild with a fast byte scan to find the first invalid ASCII byte, then bulk-copy the valid prefix and rewrite only the remainder. - Add targeted tests covering the borrowed/owned split, Windows drive-letter
:semantics, and Unicode (multi-byte UTF-8) correctness on both clean and rewrite paths.
Merging this PR will not alter performance
Comparing Footnotes
|
IWANABETHATGUY
approved these changes
Jun 23, 2026
Member
Author
Merge activity
|
… names (#9928) ## Summary `default_sanitize_file_name` (called once per output chunk/asset filename) unconditionally allocated `String::with_capacity(str.len())` and rebuilt the name char-by-char — even though the common case is a filename that contains **no** invalid characters (`index.js`, `react.production.min.js`, any path without shell/NTFS-unsafe chars) and needs no rewriting at all. This is the same wasted-allocation pattern recently fixed in `legitimize_identifier_name` (#9926). The function now returns `Cow<str>`: - **Clean path** (common): scan for the first invalid char; if none, return `Cow::Borrowed(str)` — zero allocation, zero copy. - **Dirty path**: allocate only when a replacement is actually needed, and bulk-copy the valid prefix (drive letter included) with `push_str` instead of one `char` at a time. The scan is done over **bytes, not chars**. Every invalid character is ASCII (≤ 0x7F), and UTF-8 guarantees that no byte of a multi-byte character is < 0x80, so a byte scan finds exactly the same positions as a char scan without per-char UTF-8 decoding, and every match lands on a char boundary. The dirty-path rewrite uses `u8::try_from(char)` rather than `char as u8` so a non-ASCII char (e.g. `😀`, whose low byte is `0x00`) is never truncated into a false match. Both call sites in `rolldown_common` already do `.into()` into `ArcStr` (which implements `From<Cow<str>>`), so they are unchanged. ## Measured impact Measured locally with a Criterion microbench (not committed), original `String` version vs this change: | input | before | after | change | |---|---|---|---| | `clean_short` (`index.js`) | 21.4 ns | 6.7 ns | **−69%** | | `clean_long` (78-char ASCII path) | 122 ns | 46 ns | **−63%** | | `clean_unicode` (30-char Cyrillic path) | 85 ns | 32 ns | **−63%** | | `dirty` (needs rewriting) | 60 ns | 49 ns | **−18%** | All changes statistically significant (p < 0.05). The byte scan is what unlocks the gains on longer and non-ASCII paths (no per-char decoding). ## Correctness Output is byte-for-byte identical to the previous implementation, including Windows drive-letter semantics (`C:/foo.js` preserved, later `:` still replaced: `C:/a:b.js` → `C:/a_b.js`). All `rolldown_utils` tests pass. - `test_sanitize_file_name` — the borrowed/owned split, empty string, and the Windows-drive paths. - `test_sanitize_unicode` — clean multi-byte names (2-, 3-, and 4-byte sequences: `café.js`, `日本語.js`, `компоненты/Кнопка.js`, `emoji_😀.js`) returned borrowed, and multi-byte chars surviving the rewrite path verbatim (`a?é` → `a_é`, `a?😀` → `a_😀`, `日本?語` → `日本_語`, `café:dir.js` → `café_dir.js`). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
ff13c26 to
cffb2b2
Compare
Merged
shulaoda
added a commit
that referenced
this pull request
Jun 24, 2026
## [1.1.3] - 2026-06-24 ### 🐛 Bug Fixes - `defer_drop` crashes the browser main thread (#9942) by @shulaoda - camel-case: correct camel case for nested values (#9933) by @kb019 - cli: display --help options in camelCase (#9941) by @IWANABETHATGUY - preserve used re-exports under preserveModules (#9122) (#9934) by @IWANABETHATGUY - watch: make close reentrant in event callbacks (#9904) by @hyf0 - git for windows treats symlink files as regular files (#9915) by @AliceLanniste - dev: cancel pending full reload on build error (#9903) by @h-a-n-a - chunking: pass plugin meta to codeSplitting groups name function (#9267) by @Kyujenius - dev: serve assets emitted during HMR/lazy compile (vite#22596) (#9815) by @h-a-n-a - release: dry-run step no longer publishes binding packages (#9866) by @Boshen ### 🚜 Refactor - rolldown_common: model ModuleId as a classified Path/Virtual/Bare enum (#9927) by @Boshen - remove unused LegacyModuleIdx (#9872) by @shulaoda - remove unused StmtInfos::get_namespace_stmt_info (#9870) by @shulaoda - remove unused Module::as_external_mut (#9871) by @shulaoda - remove unused EcmaAst::is_body_empty (#9869) by @shulaoda - drop dead is_css_module handling in resolve_dependencies (#9867) by @shulaoda - drop redundant with_commonjs on cjs source type (#9868) by @shulaoda ### 📚 Documentation - clarify on drafting PRs (#9952) by @h-a-n-a - update contribution guidelines (#9944) by @fubhy - note Rust crates don't follow semver in AGENTS.md (#9905) by @IWANABETHATGUY - add feedback form (#9159) by @TheAlexLichter ### ⚡ Performance - utils: avoid allocation in default_sanitize_file_name for clean names (#9928) by @Boshen - binding: box once-per-build futures before spawn_future (#9864) by @Boshen - utils: avoid wasted allocation in legitimize_identifier_name (#9926) by @Boshen - rolldown: fuse the canonical-name dedup and insert in the renamer (#9900) by @Boshen - rolldown: probe the name map once in ConflictResolver::resolve (#9899) by @Boshen - cut two heap allocations from wrapped ESM init finalize (#9901) by @Boshen - rolldown_plugin_vite_reporter: hoist invariant out_dir prefix out of reporter loop (#9873) by @shulaoda - drop throwaway Vec in wrapped esm init stmt (#9878) by @shulaoda - borrow owner_filename in build-import-analysis AddDeps (#9874) by @shulaoda ### 🧪 Testing - cover preserveModules named export via namespace re-export (#6010) (#9937) by @IWANABETHATGUY ### ⚙️ Miscellaneous Tasks - deps: update napi to v3.9.4 (#9954) by @shulaoda - reduce noise from CODEOWNERS for trival changes (#9953) by @h-a-n-a - deps: update mimalloc-safe to 0.1.64 (#9950) by @shulaoda - deps: update rollup submodule for tests to v4.62.2 (#9931) by @rolldown-guard[bot] - deps: test mimalloc-safe upstream-mimalloc switch in CI (#9930) by @shulaoda - rolldown_plugin_vite_build_import_analysis: remove unused v2 code path (#9917) by @shulaoda - rolldown_plugin_vite_manifest: remove unused is_enable_v2 code path (#9916) by @shulaoda - rolldown_plugin_vite_asset_import_meta_url: remove unexposed native vite plugin (#9896) by @shulaoda - rolldown_plugin_vite_asset: remove unexposed native vite plugin (#9895) by @shulaoda - rolldown_plugin_vite_css_post: remove unexposed native vite plugin (#9894) by @shulaoda - rolldown_plugin_vite_css: remove unexposed native vite plugin (#9893) by @shulaoda - rolldown_plugin_vite_html_inline_proxy: remove unexposed native vite plugin (#9892) by @shulaoda - rolldown_plugin_vite_html: remove unexposed native vite plugin (#9891) by @shulaoda - deps: update github actions (#9909) by @renovate[bot] - deps: update rust crate oxc_sourcemap to v8.0.2 (#9910) by @renovate[bot] - deps: update npm packages (#9912) by @renovate[bot] - deps: update github actions to v7 (#9913) by @renovate[bot] - deps: update rolldown-plugin-dts to ^0.26.0 (#9897) by @renovate[bot] - remove rolldown_filter_analyzer crate (#9865) by @Boshen ### ❤️ New Contributors * @fubhy made their first contribution in [#9944](#9944) Co-authored-by: shulaoda <165626830+shulaoda@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
default_sanitize_file_name(called once per output chunk/asset filename) unconditionally allocatedString::with_capacity(str.len())and rebuilt the name char-by-char — even though the common case is a filename that contains no invalid characters (index.js,react.production.min.js, any path without shell/NTFS-unsafe chars) and needs no rewriting at all. This is the same wasted-allocation pattern recently fixed inlegitimize_identifier_name(#9926).The function now returns
Cow<str>:Cow::Borrowed(str)— zero allocation, zero copy.push_strinstead of onecharat a time.The scan is done over bytes, not chars. Every invalid character is ASCII (≤ 0x7F), and UTF-8 guarantees that no byte of a multi-byte character is < 0x80, so a byte scan finds exactly the same positions as a char scan without per-char UTF-8 decoding, and every match lands on a char boundary. The dirty-path rewrite uses
u8::try_from(char)rather thanchar as u8so a non-ASCII char (e.g.😀, whose low byte is0x00) is never truncated into a false match.Both call sites in
rolldown_commonalready do.into()intoArcStr(which implementsFrom<Cow<str>>), so they are unchanged.Measured impact
Measured locally with a Criterion microbench (not committed), original
Stringversion vs this change:clean_short(index.js)clean_long(78-char ASCII path)clean_unicode(30-char Cyrillic path)dirty(needs rewriting)All changes statistically significant (p < 0.05). The byte scan is what unlocks the gains on longer and non-ASCII paths (no per-char decoding).
Correctness
Output is byte-for-byte identical to the previous implementation, including Windows drive-letter semantics (
C:/foo.jspreserved, later:still replaced:C:/a:b.js→C:/a_b.js). Allrolldown_utilstests pass.test_sanitize_file_name— the borrowed/owned split, empty string, and the Windows-drive paths.test_sanitize_unicode— clean multi-byte names (2-, 3-, and 4-byte sequences:café.js,日本語.js,компоненты/Кнопка.js,emoji_😀.js) returned borrowed, and multi-byte chars surviving the rewrite path verbatim (a?é→a_é,a?😀→a_😀,日本?語→日本_語,café:dir.js→café_dir.js).🤖 Generated with Claude Code