Feature/zip archive forward read#126646
Conversation
Adds a new ForwardRead mode to ZipArchive that enables forward-only sequential reading of ZIP entries from non-seekable streams, using local file headers instead of the central directory. Changes: - ZipArchiveMode.cs: Add ForwardRead = 3 enum value - ZipCustomStreams.cs: Add BoundedReadOnlyStream and ReadAheadStream helper stream classes - ZipArchive.cs: Add GetNextEntry()/GetNextEntryAsync() methods, ForwardRead constructor case, ValidateMode/DecideArchiveStream support, data descriptor parsing, and property guards - ZipArchive.Async.cs: Add ForwardRead cases to CreateAsync and DisposeAsyncCore - ZipArchiveEntry.cs: Add forward-read constructor, ForwardReadDataStream property, UpdateFromDataDescriptor method, OpenInForwardReadMode, and property setter guards - Strings.resx: Add ForwardRead error message strings - ref/System.IO.Compression.cs: Add public API surface - Tests: Add comprehensive zip_ForwardReadTests covering deflate, stored, data descriptors, non-seekable streams, empty archives, partial reads, error cases, and async operations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental forward-only streaming read mode for System.IO.Compression.ZipArchive to support sequential iteration over ZIP entries via local file headers (instead of loading the central directory up-front), including async iteration and new unit tests.
Changes:
- Add
ZipArchiveMode.ForwardReadplus publicZipArchive.GetNextEntry()/GetNextEntryAsync(...)APIs for forward-only entry iteration. - Implement forward-read parsing and stream wrappers (
ReadAheadStream,BoundedReadOnlyStream) to enable non-seekable streaming scenarios. - Add new unit tests validating basic ForwardRead behavior and unsupported operations.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libraries/System.IO.Compression/tests/ZipArchive/zip_ForwardReadTests.cs | Adds new test coverage for ForwardRead iteration, sync/async behavior, and unsupported APIs. |
| src/libraries/System.IO.Compression/tests/System.IO.Compression.Tests.csproj | Includes the new ForwardRead test file in the test project. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/ZipCustomStreams.cs | Adds BoundedReadOnlyStream and ReadAheadStream used by ForwardRead mode. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveMode.cs | Introduces ZipArchiveMode.ForwardRead enum value with documentation. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs | Adds ForwardRead entry initialization and Open() support for ForwardRead mode. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs | Implements ForwardRead iteration, local header parsing, data-descriptor handling, and non-seekable wrapping. |
| src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.Async.cs | Enables ForwardRead setup for async factory and disposal paths. |
| src/libraries/System.IO.Compression/src/Resources/Strings.resx | Adds new SR strings for ForwardRead error messages. |
| src/libraries/System.IO.Compression/ref/System.IO.Compression.cs | Updates public surface area (new mode + new ZipArchive methods). |
| private static Stream CreateForwardReadDecompressor(Stream source, ZipCompressionMethod compressionMethod, long uncompressedSize, bool leaveOpen) | ||
| { | ||
| return compressionMethod switch | ||
| { | ||
| ZipCompressionMethod.Deflate when leaveOpen => new DeflateStream(source, CompressionMode.Decompress, leaveOpen: true), | ||
| ZipCompressionMethod.Deflate => new DeflateStream(source, CompressionMode.Decompress, uncompressedSize), | ||
| ZipCompressionMethod.Deflate64 => new DeflateManagedStream(source, ZipCompressionMethod.Deflate64, uncompressedSize), |
There was a problem hiding this comment.
CreateForwardReadDecompressor ignores the leaveOpen intent for Deflate64: it always constructs DeflateManagedStream(source, ...), and DeflateManagedStream.Dispose() always disposes its underlying stream. In the data-descriptor path you pass _archiveStream with leaveOpen: true, so draining/disposing an entry will close the archive stream and prevent reading subsequent entries. Consider wrapping source in a non-disposing wrapper when leaveOpen is true (or adding a leaveOpen option to DeflateManagedStream).
| private static Stream CreateForwardReadDecompressor(Stream source, ZipCompressionMethod compressionMethod, long uncompressedSize, bool leaveOpen) | |
| { | |
| return compressionMethod switch | |
| { | |
| ZipCompressionMethod.Deflate when leaveOpen => new DeflateStream(source, CompressionMode.Decompress, leaveOpen: true), | |
| ZipCompressionMethod.Deflate => new DeflateStream(source, CompressionMode.Decompress, uncompressedSize), | |
| ZipCompressionMethod.Deflate64 => new DeflateManagedStream(source, ZipCompressionMethod.Deflate64, uncompressedSize), | |
| private sealed class NonDisposingStream : Stream | |
| { | |
| private readonly Stream _innerStream; | |
| internal NonDisposingStream(Stream innerStream) | |
| { | |
| _innerStream = innerStream; | |
| } | |
| public override bool CanRead => _innerStream.CanRead; | |
| public override bool CanSeek => _innerStream.CanSeek; | |
| public override bool CanWrite => _innerStream.CanWrite; | |
| public override long Length => _innerStream.Length; | |
| public override long Position | |
| { | |
| get => _innerStream.Position; | |
| set => _innerStream.Position = value; | |
| } | |
| public override void Flush() => _innerStream.Flush(); | |
| public override int Read(byte[] buffer, int offset, int count) => _innerStream.Read(buffer, offset, count); | |
| public override int Read(Span<byte> buffer) => _innerStream.Read(buffer); | |
| public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default) => | |
| _innerStream.ReadAsync(buffer, cancellationToken); | |
| public override long Seek(long offset, SeekOrigin origin) => _innerStream.Seek(offset, origin); | |
| public override void SetLength(long value) => _innerStream.SetLength(value); | |
| public override void Write(byte[] buffer, int offset, int count) => _innerStream.Write(buffer, offset, count); | |
| public override void Write(ReadOnlySpan<byte> buffer) => _innerStream.Write(buffer); | |
| public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default) => | |
| _innerStream.WriteAsync(buffer, cancellationToken); | |
| protected override void Dispose(bool disposing) | |
| { | |
| } | |
| public override ValueTask DisposeAsync() => default; | |
| } | |
| private static Stream CreateForwardReadDecompressor(Stream source, ZipCompressionMethod compressionMethod, long uncompressedSize, bool leaveOpen) | |
| { | |
| return compressionMethod switch | |
| { | |
| ZipCompressionMethod.Deflate when leaveOpen => new DeflateStream(source, CompressionMode.Decompress, leaveOpen: true), | |
| ZipCompressionMethod.Deflate => new DeflateStream(source, CompressionMode.Decompress, uncompressedSize), | |
| ZipCompressionMethod.Deflate64 => new DeflateManagedStream(leaveOpen ? new NonDisposingStream(source) : source, ZipCompressionMethod.Deflate64, uncompressedSize), |
| _forwardReadReachedEnd = true; | ||
| return null; |
There was a problem hiding this comment.
ReadNextLocalFileHeader treats any short read (< local header size) as end-of-archive and returns null. If the stream ends mid-header (bytesRead > 0 but < HeaderSize), that indicates a truncated/corrupt zip and should throw InvalidDataException rather than silently stopping, otherwise corruption can be misinterpreted as a clean end.
| _forwardReadReachedEnd = true; | |
| return null; | |
| if (bytesRead == 0) | |
| { | |
| _forwardReadReachedEnd = true; | |
| return null; | |
| } | |
| throw new InvalidDataException(SR.ForwardReadInvalidLocalFileHeader); |
| private async ValueTask<ZipArchiveEntry?> ReadNextLocalFileHeaderAsync(CancellationToken cancellationToken) | ||
| { | ||
| const int HeaderSize = ZipLocalFileHeader.SizeOfLocalHeader; | ||
| byte[] header = new byte[HeaderSize]; | ||
|
|
||
| int bytesRead = await _archiveStream.ReadAtLeastAsync(header, HeaderSize, throwOnEndOfStream: false, cancellationToken).ConfigureAwait(false); | ||
| if (bytesRead < HeaderSize) | ||
| { | ||
| _forwardReadReachedEnd = true; | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Same as the sync path: if ReadAtLeastAsync returns a partial local header (0 < bytesRead < HeaderSize), that should be treated as corrupted/truncated data and throw, not as an empty tail that returns null.
| // Build the data stream | ||
| Stream? dataStream = null; | ||
|
|
||
| bool isDirectory = fullName.Length > 0 && | ||
| (fullName[^1] == '/' || fullName[^1] == '\\'); | ||
| bool isEmptyEntry = !hasDataDescriptor && compressedSize == 0 && uncompressedSize == 0; | ||
|
|
||
| if (!isDirectory && !isEmptyEntry) | ||
| { | ||
| if (hasDataDescriptor) | ||
| { | ||
| // Data descriptor: unknown size, let DeflateStream detect end. | ||
| // Because ReadAheadStream.CanSeek returns true, DeflateStream will | ||
| // automatically rewind unconsumed bytes after decompression finishes, | ||
| // leaving the archive stream positioned right after the compressed data. | ||
| Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true); | ||
| dataStream = new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue); | ||
| } | ||
| else if (isEncrypted) | ||
| { | ||
| // Encrypted without data descriptor: return bounded raw stream (no decryption) | ||
| dataStream = new BoundedReadOnlyStream(_archiveStream, compressedSize); | ||
| } | ||
| else | ||
| { | ||
| // Known size, not encrypted | ||
| Stream bounded = new BoundedReadOnlyStream(_archiveStream, compressedSize); | ||
| Stream decompressor = CreateForwardReadDecompressor(bounded, compressionMethod, uncompressedSize, leaveOpen: false); | ||
| dataStream = new CrcValidatingReadStream(decompressor, crc32, uncompressedSize); | ||
| } | ||
| } |
There was a problem hiding this comment.
isEmptyEntry causes dataStream to remain null, which makes ZipArchiveEntry.Open() throw for valid zero-length file entries (it hits _forwardReadDataStream is null). ForwardRead should allow opening an empty file and return an empty readable stream (e.g., a 0-length bounded stream / Stream.Null) so callers can handle empty entries consistently.
| Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true); | ||
| dataStream = new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue); |
There was a problem hiding this comment.
In the data-descriptor path you wrap the decompressor with new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue), but later ReadDataDescriptor only updates fields on ZipArchiveEntry — it doesn’t (and can’t) update the already-constructed CrcValidatingReadStream. As a result, CRC/length are effectively never validated for data-descriptor entries in ForwardRead. Consider a stream that defers validation until after the descriptor is parsed (or validates against the parsed descriptor during draining).
| Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true); | |
| dataStream = new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue); | |
| // Do not wrap this in CrcValidatingReadStream with placeholder values: | |
| // the actual CRC and length are only available after the descriptor is parsed. | |
| Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true); | |
| dataStream = decompressor; |
| private void DrainPreviousEntry() | ||
| { | ||
| if (_forwardReadPreviousEntry is not { } prev) | ||
| return; | ||
|
|
||
| Stream? dataStream = prev.ForwardReadDataStream; | ||
| if (dataStream != null) | ||
| { | ||
| dataStream.CopyTo(Stream.Null); | ||
| dataStream.Dispose(); | ||
| } | ||
|
|
||
| if (prev.HasDataDescriptor) | ||
| { | ||
| ReadDataDescriptor(prev); | ||
| } |
There was a problem hiding this comment.
DrainPreviousEntry always drains by copying prev.ForwardReadDataStream to Stream.Null. For known-size deflate entries this forces full decompression just to advance to the next entry, even when the consumer intended to skip reading the entry. In ForwardRead mode, skipping should be able to discard the compressed bytes directly when the compressed size is known (and the entry stream was never opened), to avoid O(n) decompression overhead.
| private WrappedStream OpenInForwardReadMode() | ||
| { | ||
| if (_forwardReadDataStream is null) | ||
| throw new InvalidDataException(SR.LocalFileHeaderCorrupt); | ||
| if (_forwardReadStreamOpened) | ||
| throw new IOException(SR.ForwardReadOnly); | ||
|
|
||
| _forwardReadStreamOpened = true; | ||
|
|
||
| // Wrap so user disposal does not close our internal data stream. | ||
| // DrainPreviousEntry will drain and dispose _forwardReadDataStream itself. | ||
| return new WrappedStream(_forwardReadDataStream, closeBaseStream: false); |
There was a problem hiding this comment.
OpenInForwardReadMode throws IOException(SR.ForwardReadOnly) when Open() is called twice. That resource string reads like a mode-level restriction (“operation not supported in ForwardRead mode”), which is misleading for the actual failure (the entry stream can only be opened once). Consider using a more specific message (and/or a distinct SR) so callers can diagnose the issue correctly.
| Assert.NotNull(entry); | ||
| Assert.Equal("empty.txt", entry.FullName); | ||
| Assert.Equal(0, entry.CompressedLength); | ||
|
|
There was a problem hiding this comment.
ZeroLengthEntry_ReturnsEntryWithEmptyStream doesn’t actually open the entry stream or verify it’s empty, despite the test name. Adding an Open() + read assertion would catch regressions where empty entries aren’t openable in ForwardRead mode.
| using (Stream dataStream = entry.Open()) | |
| { | |
| byte[] decompressed = await ReadStreamFully(dataStream, async); | |
| Assert.Empty(decompressed); | |
| } |
| public async Task Dispose_WhileEntryPartiallyRead_DoesNotThrow(bool async) | ||
| { | ||
| byte[] zipBytes = CreateZipWithEntries(CompressionLevel.Optimal, seekable: true); | ||
|
|
||
| using MemoryStream archiveStream = new(zipBytes); | ||
| ZipArchive archive = new(archiveStream, ZipArchiveMode.ForwardRead, leaveOpen: true); | ||
|
|
||
| ZipArchiveEntry? entry = await GetNextEntry(archive, async); | ||
| Assert.NotNull(entry); | ||
|
|
||
| // Partially read via Open() | ||
| Stream ds = entry.Open(); | ||
| byte[] partial = new byte[5]; | ||
| await ReadStream(ds, partial, async); | ||
|
|
||
| // Dispose should not throw | ||
| archive.Dispose(); | ||
| } |
There was a problem hiding this comment.
Dispose_WhileEntryPartiallyRead_DoesNotThrow leaves the entry stream (ds) undisposed. Even though this is a test, it can lead to nondeterministic resource cleanup and makes failures harder to diagnose. Consider disposing ds (e.g., using / DisposeAsync) while still keeping the intent of the test (disposing the archive while an entry stream is/was active).
| using Stream dataStream = entry.Open(); | ||
| byte[] decompressed = await ReadStreamFully(dataStream, async); | ||
| Assert.Equal(expectedContents[i], decompressed); | ||
| } |
There was a problem hiding this comment.
Read_StoredWithKnownSize_ReturnsUncompressedData iterates expected entries but doesn’t assert that GetNextEntry returns null afterward. Adding an end-of-archive assertion (as in the deflate test) would make the test stricter and help catch accidental extra entries / incorrect boundary handling.
| } | |
| } | |
| ZipArchiveEntry? end = await GetNextEntry(archive, async); | |
| Assert.Null(end); |
| @@ -147,6 +150,8 @@ protected virtual async ValueTask DisposeAsyncCore() | |||
| { | |||
| case ZipArchiveMode.Read: | |||
| break; | |||
| return GetNextEntryAsyncCore(cancellationToken); | ||
| } | ||
|
|
||
| private async ValueTask<ZipArchiveEntry?> GetNextEntryAsyncCore(CancellationToken cancellationToken) | ||
| { | ||
| await DrainPreviousEntryAsync(cancellationToken).ConfigureAwait(false); | ||
| return await ReadNextLocalFileHeaderAsync(cancellationToken).ConfigureAwait(false); | ||
| } |
There was a problem hiding this comment.
nit: if the function is ever called only from that one function, we prefer local functions
| return GetNextEntryAsyncCore(cancellationToken); | |
| } | |
| private async ValueTask<ZipArchiveEntry?> GetNextEntryAsyncCore(CancellationToken cancellationToken) | |
| { | |
| await DrainPreviousEntryAsync(cancellationToken).ConfigureAwait(false); | |
| return await ReadNextLocalFileHeaderAsync(cancellationToken).ConfigureAwait(false); | |
| } | |
| return GetNextEntryAsyncCore(cancellationToken); | |
| async ValueTask<ZipArchiveEntry?> GetNextEntryAsyncCore(CancellationToken cancellationToken) | |
| { | |
| await DrainPreviousEntryAsync(cancellationToken).ConfigureAwait(false); | |
| return await ReadNextLocalFileHeaderAsync(cancellationToken).ConfigureAwait(false); | |
| } | |
| } |
| Stream? dataStream = prev.ForwardReadDataStream; | ||
| if (dataStream != null) | ||
| { | ||
| dataStream.CopyTo(Stream.Null); |
There was a problem hiding this comment.
This would decompress the compressed data, can we avoid the unnecessary CPU work?
| if (prev.HasDataDescriptor) | ||
| { | ||
| ReadDataDescriptor(prev); | ||
| } |
There was a problem hiding this comment.
I thought we don't support entries with DataDescriptors
| // 32-bit no sig: crc(4) + comp(4) + uncomp(4) = 12 bytes | ||
| // 64-bit with sig: sig(4) + crc(4) + comp(8) + uncomp(8) = 24 bytes | ||
| // 64-bit no sig: crc(4) + comp(8) + uncomp(8) = 20 bytes | ||
| // Read the maximum (24 bytes), determine format, seek back the unused portion. |
There was a problem hiding this comment.
The whole point of forward-only zip APIs is to consume ZIPs from unseekable streams, so we can't really seek here.
| { | ||
| if (hasDataDescriptor) | ||
| { | ||
| // Data descriptor: unknown size, let DeflateStream detect end. |
There was a problem hiding this comment.
This assumes that reading will use DeflateStream, but the file may be stored without compression, or with a different compression (Deflate64, or something else in the future)
| else if (isEncrypted) | ||
| { | ||
| // Encrypted without data descriptor: return bounded raw stream (no decryption) | ||
| dataStream = new BoundedReadOnlyStream(_archiveStream, compressedSize); | ||
| } |
There was a problem hiding this comment.
Does this match the current behavior for encrypted entries?
FWIW, I would ignore encryption for now, and either revisit it once Password support lands in main, or (if this gets merged first) address it in the Password support PR
| // Known size, not encrypted | ||
| Stream bounded = new BoundedReadOnlyStream(_archiveStream, compressedSize); | ||
| Stream decompressor = CreateForwardReadDecompressor(bounded, compressionMethod, uncompressedSize, leaveOpen: false); | ||
| dataStream = new CrcValidatingReadStream(decompressor, crc32, uncompressedSize); |
There was a problem hiding this comment.
I wonder if we should postpone DataStream creation and do it lazily like we do for the other entries.
| internal void UpdateFromDataDescriptor(uint crc32, long compressedSize, long uncompressedSize) | ||
| { | ||
| _crc32 = crc32; | ||
| _compressedSize = compressedSize; | ||
| _uncompressedSize = uncompressedSize; | ||
| } |
There was a problem hiding this comment.
The behavior that some properties update only after we completely drain the DataStream seems weird, to me it feels like another signal we should not support DataDescriptor entries in forward only mode, unless it really makes the feature unusable
| /// Entries are read from local file headers instead of the central directory, enabling reading from non-seekable streams | ||
| /// without buffering the entire archive. The underlying stream must be readable but need not be seekable. | ||
| /// </summary> | ||
| ForwardRead = 3 |
There was a problem hiding this comment.
| ForwardRead = 3 | |
| ForwardRead |
Fixes #1550
Experiment with new Forward only reading mode for ziparchvie, that relies on the local headers of entries instead of the central directory, avoiding loading everything into memory at once