Skip to content

fix: allow for [u8] as filename#775

Merged
Pr0methean merged 6 commits into
masterfrom
test-internal-index-bytes
Apr 23, 2026
Merged

fix: allow for [u8] as filename#775
Pr0methean merged 6 commits into
masterfrom
test-internal-index-bytes

Conversation

@Its-Just-Nans

Copy link
Copy Markdown
Member

@Its-Just-Nans Its-Just-Nans marked this pull request as draft April 11, 2026 21:34

@amazon-q-developer amazon-q-developer Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR changes the internal representation of file names from Box<str> to Box<[u8]> to support non-UTF-8 filenames. However, there's a critical defect in finish_into_readable() where expect("invalid UTF-8") will cause runtime panics for ZIP archives containing non-UTF-8 filenames. This breaks existing functionality since ZIP files can legally contain such filenames using encodings like CP437.


You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

Comment thread src/write.rs Outdated
Comment thread src/write.rs Fixed
Comment thread src/write.rs Fixed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the internal file mapping to use Box<[u8]> instead of Box for filenames, facilitating support for non-UTF-8 paths. However, the implementation introduces a potential panic in the finish method when converting keys back to UTF-8, which contradicts the goal of supporting arbitrary bytes. Additionally, there are opportunities to optimize memory allocations during file insertion by using existing raw filename fields instead of re-allocating byte slices.

Comment thread src/write.rs Outdated
Comment thread src/write.rs Outdated
Comment on lines +1395 to +1400
if self.files.contains_key(file.file_name.as_bytes()) {
return Err(invalid!("Duplicate filename: {}", file.file_name));
}
let (index, _) = self.files.insert_full(file.file_name.clone(), file);
let (index, _) = self
.files
.insert_full(file.file_name.as_bytes().to_vec().into_boxed_slice(), file);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of insert_file_data uses file.file_name.as_bytes() to generate the map key, which is inconsistent with the goal of supporting raw byte filenames. Additionally, the sequence .to_vec().into_boxed_slice() performs redundant allocations. It is more efficient and consistent to use file.file_name_raw.clone().

        if self.files.contains_key(file.file_name_raw.as_ref()) {
            return Err(invalid!("Duplicate filename: {}", file.file_name));
        }
        let (index, _) = self.files.insert_full(file.file_name_raw.clone(), file);
References
  1. To skip bytes from a reader, use std::io::copy with std::io::sink() or a stack-allocated buffer instead of allocating a Vec on the heap to avoid unnecessary allocations.
  2. Avoid exposing internal data structures in public APIs. Instead, accept primitive types (e.g., &[u8]) and perform validation and construction internally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Its-Just-Nans Its-Just-Nans mentioned this pull request Apr 12, 2026
2 tasks
@Its-Just-Nans Its-Just-Nans marked this pull request as ready for review April 12, 2026 06:02

@Pr0methean Pr0methean left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, but some of the details are unclear to me.

Comment thread src/read/zip_archive.rs
/// Returns an iterator over all the file and directory names in this archive.
pub fn file_names(&self) -> impl Iterator<Item = &str> {
self.shared.files.keys().map(std::convert::AsRef::as_ref)
self.shared.files.values().map(|f| f.file_name.as_ref())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does this change in implementation serve?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key is [u8] item cannot be str

Previously we used the key which was str
Now the key is [u8] so we cannot
And we use the file name inside the data

The thing is, to remove the triple allocation of file name, this method will be a breaking change (in another MR)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just convert the key back to str? Granted, it doesn't handle the case where the name is an unknown encoding; but this PR doesn't handle that case either.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this change lets us piggyback on some existing handling of non-UTF8, then that may be a valid reason but should be documented in a comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the goal of the MR.

I'm not sure how a good idea it is to save the file name as a string. String are UTF-8, the zip specs doesn't enforce that.

I also think that a futur MR would be needed to change the way we are handling the file_name in ZipFileData

Two points:

  • the file name raw is duplicated between the key and the ZipFileData, I think this is removable
  • the file name (non-raw) is also saved in zipfiledata. I don't think that's a good way to do it. But this type of change would mean a breaking change (method using file name will return a results, and the UTF-8 string will be lazy computed on call, not saved in the struct)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably have to take another look at this situation on the weekend.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the raw filename would probably require the values in the files-by-name map to borrow the key, assuming that's even possible (although if we went back to an IndexMap we could store the index instead), so that a reverse lookup would be possible. Is that what you're suggesting?

@Its-Just-Nans Its-Just-Nans Apr 21, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the raw filename would probably require the values in the files-by-name map to borrow the key, assuming that's even possible (although if we went back to an IndexMap we could store the index instead), so that a reverse lookup would be possible. Is that what you're suggesting?

Yes kinda that will be this https://github.com/zip-rs/zip2/compare/only-file-name?expand=1 which is a MR on top of this current branch

ZipFileData will become (file_name_raw, ZipFileDataWithout_file_name_raw)

Comment thread src/read/zip_archive.rs
@Pr0methean Pr0methean added this pull request to the merge queue Apr 23, 2026
Merged via the queue into master with commit 0329d9d Apr 23, 2026
133 checks passed
@Pr0methean Pr0methean deleted the test-internal-index-bytes branch April 23, 2026 23:46
@Pr0methean Pr0methean mentioned this pull request Apr 23, 2026
@Its-Just-Nans Its-Just-Nans self-assigned this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants