fix: avoid scanning through all local file headers when opening an archive#281
Conversation
|
|
||
| let data_start = find_data_start(&file, reader)?; | ||
|
|
||
| if data_start > central_directory.directory_start { |
There was a problem hiding this comment.
Shouldn't we still check for this eventually?
There was a problem hiding this comment.
The question is what the ultimate purpose of this check is. It looks a bit like a conservative check of some specification requirement.
But is it require for correctness / ruling out weird edge cases? After all no tests fail after removing this check.
In some way, being able to do this check runs counter to the idea of this PR of not having to scan through the file for random access. I.e. even if we defer, e.g. to the below ZipFileData.data_start or move the check into find_data_start, the random access use case I have in mind here will never execute the check.
There was a problem hiding this comment.
IIRC it does relate to edge cases that have come up in fuzzing, such as when archives are concatenated or nested or the magic bytes occur in filenames. Even if the spec is ambiguous in those cases about which one to extract, which I'm pretty sure is true for concatenation when the second one is under 64KiB, we should still consistently choose one or the other.
Pr0methean
left a comment
There was a problem hiding this comment.
Sounds good in principle, but I don't like the idea of totally removing the validation when we could just defer it.
Signed-off-by: Chris Hennick <4961925+Pr0methean@users.noreply.github.com>
|
@Pr0methean Just a suggestion: If you can't find a way to add the check back without incurring a performance penalty, maybe it could be made optional with a parameter on the |
Fixes #280
The idea is to make
ZipFileData.data_startbe calculated lazy to avoid accessing all local file headers already when opening a file. This required a change to the signature ofZipFileData.data_start()(which seems to be non-public after all).