Skip to content

Feature/zip archive forward read#126646

Draft
alinpahontu2912 wants to merge 3 commits intodotnet:mainfrom
alinpahontu2912:feature/zip-archive-forward-read
Draft

Feature/zip archive forward read#126646
alinpahontu2912 wants to merge 3 commits intodotnet:mainfrom
alinpahontu2912:feature/zip-archive-forward-read

Conversation

@alinpahontu2912
Copy link
Copy Markdown
Member

Fixes #1550

Experiment with new Forward only reading mode for ziparchvie, that relies on the local headers of entries instead of the central directory, avoiding loading everything into memory at once

alinpahontu2912 and others added 3 commits April 7, 2026 16:01
Adds a new ForwardRead mode to ZipArchive that enables forward-only
sequential reading of ZIP entries from non-seekable streams, using
local file headers instead of the central directory.

Changes:
- ZipArchiveMode.cs: Add ForwardRead = 3 enum value
- ZipCustomStreams.cs: Add BoundedReadOnlyStream and ReadAheadStream
  helper stream classes
- ZipArchive.cs: Add GetNextEntry()/GetNextEntryAsync() methods,
  ForwardRead constructor case, ValidateMode/DecideArchiveStream
  support, data descriptor parsing, and property guards
- ZipArchive.Async.cs: Add ForwardRead cases to CreateAsync and
  DisposeAsyncCore
- ZipArchiveEntry.cs: Add forward-read constructor, ForwardReadDataStream
  property, UpdateFromDataDescriptor method, OpenInForwardReadMode,
  and property setter guards
- Strings.resx: Add ForwardRead error message strings
- ref/System.IO.Compression.cs: Add public API surface
- Tests: Add comprehensive zip_ForwardReadTests covering deflate,
  stored, data descriptors, non-seekable streams, empty archives,
  partial reads, error cases, and async operations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental forward-only streaming read mode for System.IO.Compression.ZipArchive to support sequential iteration over ZIP entries via local file headers (instead of loading the central directory up-front), including async iteration and new unit tests.

Changes:

  • Add ZipArchiveMode.ForwardRead plus public ZipArchive.GetNextEntry() / GetNextEntryAsync(...) APIs for forward-only entry iteration.
  • Implement forward-read parsing and stream wrappers (ReadAheadStream, BoundedReadOnlyStream) to enable non-seekable streaming scenarios.
  • Add new unit tests validating basic ForwardRead behavior and unsupported operations.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/libraries/System.IO.Compression/tests/ZipArchive/zip_ForwardReadTests.cs Adds new test coverage for ForwardRead iteration, sync/async behavior, and unsupported APIs.
src/libraries/System.IO.Compression/tests/System.IO.Compression.Tests.csproj Includes the new ForwardRead test file in the test project.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipCustomStreams.cs Adds BoundedReadOnlyStream and ReadAheadStream used by ForwardRead mode.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveMode.cs Introduces ZipArchiveMode.ForwardRead enum value with documentation.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchiveEntry.cs Adds ForwardRead entry initialization and Open() support for ForwardRead mode.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs Implements ForwardRead iteration, local header parsing, data-descriptor handling, and non-seekable wrapping.
src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.Async.cs Enables ForwardRead setup for async factory and disposal paths.
src/libraries/System.IO.Compression/src/Resources/Strings.resx Adds new SR strings for ForwardRead error messages.
src/libraries/System.IO.Compression/ref/System.IO.Compression.cs Updates public surface area (new mode + new ZipArchive methods).

Comment on lines +546 to +552
private static Stream CreateForwardReadDecompressor(Stream source, ZipCompressionMethod compressionMethod, long uncompressedSize, bool leaveOpen)
{
return compressionMethod switch
{
ZipCompressionMethod.Deflate when leaveOpen => new DeflateStream(source, CompressionMode.Decompress, leaveOpen: true),
ZipCompressionMethod.Deflate => new DeflateStream(source, CompressionMode.Decompress, uncompressedSize),
ZipCompressionMethod.Deflate64 => new DeflateManagedStream(source, ZipCompressionMethod.Deflate64, uncompressedSize),
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateForwardReadDecompressor ignores the leaveOpen intent for Deflate64: it always constructs DeflateManagedStream(source, ...), and DeflateManagedStream.Dispose() always disposes its underlying stream. In the data-descriptor path you pass _archiveStream with leaveOpen: true, so draining/disposing an entry will close the archive stream and prevent reading subsequent entries. Consider wrapping source in a non-disposing wrapper when leaveOpen is true (or adding a leaveOpen option to DeflateManagedStream).

Suggested change
private static Stream CreateForwardReadDecompressor(Stream source, ZipCompressionMethod compressionMethod, long uncompressedSize, bool leaveOpen)
{
return compressionMethod switch
{
ZipCompressionMethod.Deflate when leaveOpen => new DeflateStream(source, CompressionMode.Decompress, leaveOpen: true),
ZipCompressionMethod.Deflate => new DeflateStream(source, CompressionMode.Decompress, uncompressedSize),
ZipCompressionMethod.Deflate64 => new DeflateManagedStream(source, ZipCompressionMethod.Deflate64, uncompressedSize),
private sealed class NonDisposingStream : Stream
{
private readonly Stream _innerStream;
internal NonDisposingStream(Stream innerStream)
{
_innerStream = innerStream;
}
public override bool CanRead => _innerStream.CanRead;
public override bool CanSeek => _innerStream.CanSeek;
public override bool CanWrite => _innerStream.CanWrite;
public override long Length => _innerStream.Length;
public override long Position
{
get => _innerStream.Position;
set => _innerStream.Position = value;
}
public override void Flush() => _innerStream.Flush();
public override int Read(byte[] buffer, int offset, int count) => _innerStream.Read(buffer, offset, count);
public override int Read(Span<byte> buffer) => _innerStream.Read(buffer);
public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default) =>
_innerStream.ReadAsync(buffer, cancellationToken);
public override long Seek(long offset, SeekOrigin origin) => _innerStream.Seek(offset, origin);
public override void SetLength(long value) => _innerStream.SetLength(value);
public override void Write(byte[] buffer, int offset, int count) => _innerStream.Write(buffer, offset, count);
public override void Write(ReadOnlySpan<byte> buffer) => _innerStream.Write(buffer);
public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default) =>
_innerStream.WriteAsync(buffer, cancellationToken);
protected override void Dispose(bool disposing)
{
}
public override ValueTask DisposeAsync() => default;
}
private static Stream CreateForwardReadDecompressor(Stream source, ZipCompressionMethod compressionMethod, long uncompressedSize, bool leaveOpen)
{
return compressionMethod switch
{
ZipCompressionMethod.Deflate when leaveOpen => new DeflateStream(source, CompressionMode.Decompress, leaveOpen: true),
ZipCompressionMethod.Deflate => new DeflateStream(source, CompressionMode.Decompress, uncompressedSize),
ZipCompressionMethod.Deflate64 => new DeflateManagedStream(leaveOpen ? new NonDisposingStream(source) : source, ZipCompressionMethod.Deflate64, uncompressedSize),

Copilot uses AI. Check for mistakes.
Comment on lines +565 to +566
_forwardReadReachedEnd = true;
return null;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadNextLocalFileHeader treats any short read (< local header size) as end-of-archive and returns null. If the stream ends mid-header (bytesRead > 0 but < HeaderSize), that indicates a truncated/corrupt zip and should throw InvalidDataException rather than silently stopping, otherwise corruption can be misinterpreted as a clean end.

Suggested change
_forwardReadReachedEnd = true;
return null;
if (bytesRead == 0)
{
_forwardReadReachedEnd = true;
return null;
}
throw new InvalidDataException(SR.ForwardReadInvalidLocalFileHeader);

Copilot uses AI. Check for mistakes.
Comment on lines +584 to +594
private async ValueTask<ZipArchiveEntry?> ReadNextLocalFileHeaderAsync(CancellationToken cancellationToken)
{
const int HeaderSize = ZipLocalFileHeader.SizeOfLocalHeader;
byte[] header = new byte[HeaderSize];

int bytesRead = await _archiveStream.ReadAtLeastAsync(header, HeaderSize, throwOnEndOfStream: false, cancellationToken).ConfigureAwait(false);
if (bytesRead < HeaderSize)
{
_forwardReadReachedEnd = true;
return null;
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the sync path: if ReadAtLeastAsync returns a partial local header (0 < bytesRead < HeaderSize), that should be treated as corrupted/truncated data and throw, not as an empty tail that returns null.

Copilot uses AI. Check for mistakes.
Comment on lines +670 to +700
// Build the data stream
Stream? dataStream = null;

bool isDirectory = fullName.Length > 0 &&
(fullName[^1] == '/' || fullName[^1] == '\\');
bool isEmptyEntry = !hasDataDescriptor && compressedSize == 0 && uncompressedSize == 0;

if (!isDirectory && !isEmptyEntry)
{
if (hasDataDescriptor)
{
// Data descriptor: unknown size, let DeflateStream detect end.
// Because ReadAheadStream.CanSeek returns true, DeflateStream will
// automatically rewind unconsumed bytes after decompression finishes,
// leaving the archive stream positioned right after the compressed data.
Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true);
dataStream = new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue);
}
else if (isEncrypted)
{
// Encrypted without data descriptor: return bounded raw stream (no decryption)
dataStream = new BoundedReadOnlyStream(_archiveStream, compressedSize);
}
else
{
// Known size, not encrypted
Stream bounded = new BoundedReadOnlyStream(_archiveStream, compressedSize);
Stream decompressor = CreateForwardReadDecompressor(bounded, compressionMethod, uncompressedSize, leaveOpen: false);
dataStream = new CrcValidatingReadStream(decompressor, crc32, uncompressedSize);
}
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEmptyEntry causes dataStream to remain null, which makes ZipArchiveEntry.Open() throw for valid zero-length file entries (it hits _forwardReadDataStream is null). ForwardRead should allow opening an empty file and return an empty readable stream (e.g., a 0-length bounded stream / Stream.Null) so callers can handle empty entries consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +685 to +686
Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true);
dataStream = new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue);
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the data-descriptor path you wrap the decompressor with new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue), but later ReadDataDescriptor only updates fields on ZipArchiveEntry — it doesn’t (and can’t) update the already-constructed CrcValidatingReadStream. As a result, CRC/length are effectively never validated for data-descriptor entries in ForwardRead. Consider a stream that defers validation until after the descriptor is parsed (or validates against the parsed descriptor during draining).

Suggested change
Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true);
dataStream = new CrcValidatingReadStream(decompressor, expectedCrc: 0, expectedLength: long.MaxValue);
// Do not wrap this in CrcValidatingReadStream with placeholder values:
// the actual CRC and length are only available after the descriptor is parsed.
Stream decompressor = CreateForwardReadDecompressor(_archiveStream, compressionMethod, -1, leaveOpen: true);
dataStream = decompressor;

Copilot uses AI. Check for mistakes.
Comment on lines +420 to +435
private void DrainPreviousEntry()
{
if (_forwardReadPreviousEntry is not { } prev)
return;

Stream? dataStream = prev.ForwardReadDataStream;
if (dataStream != null)
{
dataStream.CopyTo(Stream.Null);
dataStream.Dispose();
}

if (prev.HasDataDescriptor)
{
ReadDataDescriptor(prev);
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DrainPreviousEntry always drains by copying prev.ForwardReadDataStream to Stream.Null. For known-size deflate entries this forces full decompression just to advance to the next entry, even when the consumer intended to skip reading the entry. In ForwardRead mode, skipping should be able to discard the compressed bytes directly when the compressed size is known (and the entry stream was never opened), to avoid O(n) decompression overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +919 to +930
private WrappedStream OpenInForwardReadMode()
{
if (_forwardReadDataStream is null)
throw new InvalidDataException(SR.LocalFileHeaderCorrupt);
if (_forwardReadStreamOpened)
throw new IOException(SR.ForwardReadOnly);

_forwardReadStreamOpened = true;

// Wrap so user disposal does not close our internal data stream.
// DrainPreviousEntry will drain and dispose _forwardReadDataStream itself.
return new WrappedStream(_forwardReadDataStream, closeBaseStream: false);
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenInForwardReadMode throws IOException(SR.ForwardReadOnly) when Open() is called twice. That resource string reads like a mode-level restriction (“operation not supported in ForwardRead mode”), which is misleading for the actual failure (the entry stream can only be opened once). Consider using a more specific message (and/or a distinct SR) so callers can diagnose the issue correctly.

Copilot uses AI. Check for mistakes.
Assert.NotNull(entry);
Assert.Equal("empty.txt", entry.FullName);
Assert.Equal(0, entry.CompressedLength);

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZeroLengthEntry_ReturnsEntryWithEmptyStream doesn’t actually open the entry stream or verify it’s empty, despite the test name. Adding an Open() + read assertion would catch regressions where empty entries aren’t openable in ForwardRead mode.

Suggested change
using (Stream dataStream = entry.Open())
{
byte[] decompressed = await ReadStreamFully(dataStream, async);
Assert.Empty(decompressed);
}

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +399
public async Task Dispose_WhileEntryPartiallyRead_DoesNotThrow(bool async)
{
byte[] zipBytes = CreateZipWithEntries(CompressionLevel.Optimal, seekable: true);

using MemoryStream archiveStream = new(zipBytes);
ZipArchive archive = new(archiveStream, ZipArchiveMode.ForwardRead, leaveOpen: true);

ZipArchiveEntry? entry = await GetNextEntry(archive, async);
Assert.NotNull(entry);

// Partially read via Open()
Stream ds = entry.Open();
byte[] partial = new byte[5];
await ReadStream(ds, partial, async);

// Dispose should not throw
archive.Dispose();
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispose_WhileEntryPartiallyRead_DoesNotThrow leaves the entry stream (ds) undisposed. Even though this is a test, it can lead to nondeterministic resource cleanup and makes failures harder to diagnose. Consider disposing ds (e.g., using / DisposeAsync) while still keeping the intent of the test (disposing the archive while an entry stream is/was active).

Copilot uses AI. Check for mistakes.
using Stream dataStream = entry.Open();
byte[] decompressed = await ReadStreamFully(dataStream, async);
Assert.Equal(expectedContents[i], decompressed);
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read_StoredWithKnownSize_ReturnsUncompressedData iterates expected entries but doesn’t assert that GetNextEntry returns null afterward. Adding an end-of-archive assertion (as in the deflate test) would make the test stricter and help catch accidental extra entries / incorrect boundary handling.

Suggested change
}
}
ZipArchiveEntry? end = await GetNextEntry(archive, async);
Assert.Null(end);

Copilot uses AI. Check for mistakes.
@@ -147,6 +150,8 @@ protected virtual async ValueTask DisposeAsyncCore()
{
case ZipArchiveMode.Read:
break;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break;

Comment on lines +411 to +418
return GetNextEntryAsyncCore(cancellationToken);
}

private async ValueTask<ZipArchiveEntry?> GetNextEntryAsyncCore(CancellationToken cancellationToken)
{
await DrainPreviousEntryAsync(cancellationToken).ConfigureAwait(false);
return await ReadNextLocalFileHeaderAsync(cancellationToken).ConfigureAwait(false);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the function is ever called only from that one function, we prefer local functions

Suggested change
return GetNextEntryAsyncCore(cancellationToken);
}
private async ValueTask<ZipArchiveEntry?> GetNextEntryAsyncCore(CancellationToken cancellationToken)
{
await DrainPreviousEntryAsync(cancellationToken).ConfigureAwait(false);
return await ReadNextLocalFileHeaderAsync(cancellationToken).ConfigureAwait(false);
}
return GetNextEntryAsyncCore(cancellationToken);
async ValueTask<ZipArchiveEntry?> GetNextEntryAsyncCore(CancellationToken cancellationToken)
{
await DrainPreviousEntryAsync(cancellationToken).ConfigureAwait(false);
return await ReadNextLocalFileHeaderAsync(cancellationToken).ConfigureAwait(false);
}
}

Stream? dataStream = prev.ForwardReadDataStream;
if (dataStream != null)
{
dataStream.CopyTo(Stream.Null);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would decompress the compressed data, can we avoid the unnecessary CPU work?

Comment on lines +432 to +435
if (prev.HasDataDescriptor)
{
ReadDataDescriptor(prev);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we don't support entries with DataDescriptors

// 32-bit no sig: crc(4) + comp(4) + uncomp(4) = 12 bytes
// 64-bit with sig: sig(4) + crc(4) + comp(8) + uncomp(8) = 24 bytes
// 64-bit no sig: crc(4) + comp(8) + uncomp(8) = 20 bytes
// Read the maximum (24 bytes), determine format, seek back the unused portion.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of forward-only zip APIs is to consume ZIPs from unseekable streams, so we can't really seek here.

{
if (hasDataDescriptor)
{
// Data descriptor: unknown size, let DeflateStream detect end.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that reading will use DeflateStream, but the file may be stored without compression, or with a different compression (Deflate64, or something else in the future)

Comment on lines +688 to +692
else if (isEncrypted)
{
// Encrypted without data descriptor: return bounded raw stream (no decryption)
dataStream = new BoundedReadOnlyStream(_archiveStream, compressedSize);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this match the current behavior for encrypted entries?

FWIW, I would ignore encryption for now, and either revisit it once Password support lands in main, or (if this gets merged first) address it in the Password support PR

Comment on lines +695 to +698
// Known size, not encrypted
Stream bounded = new BoundedReadOnlyStream(_archiveStream, compressedSize);
Stream decompressor = CreateForwardReadDecompressor(bounded, compressionMethod, uncompressedSize, leaveOpen: false);
dataStream = new CrcValidatingReadStream(decompressor, crc32, uncompressedSize);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should postpone DataStream creation and do it lazily like we do for the other entries.

Comment on lines +228 to +233
internal void UpdateFromDataDescriptor(uint crc32, long compressedSize, long uncompressedSize)
{
_crc32 = crc32;
_compressedSize = compressedSize;
_uncompressedSize = uncompressedSize;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior that some properties update only after we completely drain the DataStream seems weird, to me it feels like another signal we should not support DataDescriptor entries in forward only mode, unless it really makes the feature unusable

/// Entries are read from local file headers instead of the central directory, enabling reading from non-seekable streams
/// without buffering the entire archive. The underlying stream must be readable but need not be seekable.
/// </summary>
ForwardRead = 3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ForwardRead = 3
ForwardRead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a Forward-only API for System.IO.Compression

3 participants