fix: allow for [u8] as filename#775
Conversation
Its-Just-Nans
commented
Apr 11, 2026
- The PR title must conform to Conventional Commits and start with one of the types specified by the Angular convention.
There was a problem hiding this comment.
The PR changes the internal representation of file names from Box<str> to Box<[u8]> to support non-UTF-8 filenames. However, there's a critical defect in finish_into_readable() where expect("invalid UTF-8") will cause runtime panics for ZIP archives containing non-UTF-8 filenames. This breaks existing functionality since ZIP files can legally contain such filenames using encodings like CP437.
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
There was a problem hiding this comment.
Code Review
This pull request updates the internal file mapping to use Box<[u8]> instead of Box for filenames, facilitating support for non-UTF-8 paths. However, the implementation introduces a potential panic in the finish method when converting keys back to UTF-8, which contradicts the goal of supporting arbitrary bytes. Additionally, there are opportunities to optimize memory allocations during file insertion by using existing raw filename fields instead of re-allocating byte slices.
| if self.files.contains_key(file.file_name.as_bytes()) { | ||
| return Err(invalid!("Duplicate filename: {}", file.file_name)); | ||
| } | ||
| let (index, _) = self.files.insert_full(file.file_name.clone(), file); | ||
| let (index, _) = self | ||
| .files | ||
| .insert_full(file.file_name.as_bytes().to_vec().into_boxed_slice(), file); |
There was a problem hiding this comment.
The implementation of insert_file_data uses file.file_name.as_bytes() to generate the map key, which is inconsistent with the goal of supporting raw byte filenames. Additionally, the sequence .to_vec().into_boxed_slice() performs redundant allocations. It is more efficient and consistent to use file.file_name_raw.clone().
if self.files.contains_key(file.file_name_raw.as_ref()) {
return Err(invalid!("Duplicate filename: {}", file.file_name));
}
let (index, _) = self.files.insert_full(file.file_name_raw.clone(), file);References
- To skip bytes from a reader, use std::io::copy with std::io::sink() or a stack-allocated buffer instead of allocating a Vec on the heap to avoid unnecessary allocations.
- Avoid exposing internal data structures in public APIs. Instead, accept primitive types (e.g., &[u8]) and perform validation and construction internally.
Pr0methean
left a comment
There was a problem hiding this comment.
Mostly looks good, but some of the details are unclear to me.
| /// Returns an iterator over all the file and directory names in this archive. | ||
| pub fn file_names(&self) -> impl Iterator<Item = &str> { | ||
| self.shared.files.keys().map(std::convert::AsRef::as_ref) | ||
| self.shared.files.values().map(|f| f.file_name.as_ref()) |
There was a problem hiding this comment.
What purpose does this change in implementation serve?
There was a problem hiding this comment.
If the key is [u8] item cannot be str
Previously we used the key which was str
Now the key is [u8] so we cannot
And we use the file name inside the data
The thing is, to remove the triple allocation of file name, this method will be a breaking change (in another MR)
There was a problem hiding this comment.
Why can't we just convert the key back to str? Granted, it doesn't handle the case where the name is an unknown encoding; but this PR doesn't handle that case either.
There was a problem hiding this comment.
If this change lets us piggyback on some existing handling of non-UTF8, then that may be a valid reason but should be documented in a comment.
There was a problem hiding this comment.
Yes this is the goal of the MR.
I'm not sure how a good idea it is to save the file name as a string. String are UTF-8, the zip specs doesn't enforce that.
I also think that a futur MR would be needed to change the way we are handling the file_name in ZipFileData
Two points:
- the file name raw is duplicated between the key and the ZipFileData, I think this is removable
- the file name (non-raw) is also saved in zipfiledata. I don't think that's a good way to do it. But this type of change would mean a breaking change (method using file name will return a results, and the UTF-8 string will be lazy computed on call, not saved in the struct)
There was a problem hiding this comment.
I'll probably have to take another look at this situation on the weekend.
There was a problem hiding this comment.
Removing the raw filename would probably require the values in the files-by-name map to borrow the key, assuming that's even possible (although if we went back to an IndexMap we could store the index instead), so that a reverse lookup would be possible. Is that what you're suggesting?
There was a problem hiding this comment.
Removing the raw filename would probably require the values in the files-by-name map to borrow the key, assuming that's even possible (although if we went back to an IndexMap we could store the index instead), so that a reverse lookup would be possible. Is that what you're suggesting?
Yes kinda that will be this https://github.com/zip-rs/zip2/compare/only-file-name?expand=1 which is a MR on top of this current branch
ZipFileData will become (file_name_raw, ZipFileDataWithout_file_name_raw)