Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion#125283
Fix TarReader: implement full GNU sparse format 1.0 (PAX) reading with DataStream expansion#125283
Conversation
…rse.name and GNU.sparse.realsize Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
…sertions for data stream integrity Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
|
@copilot What does happen when we attempt to extract such archives? If we are to transparently handle the real name and size, then we should also expand the contents when reading the entrys DataStream |
Extracting GNU sparse format 1.0 PAX entries with the current code produces a file with the correct name (from our fix) but incorrect content. Here's what happens: the
When Properly expanding the DataStream would require:
This is significantly more complex than the current fix. The current PR is scoped to fix the metadata issues ( I can revert the |
|
Tagging subscribers to this area: @dotnet/area-system-formats-tar |
|
@copilot simply fixing the exposed metadata does not address the original reported issue, you need to implement complete handling of the pax 1.0 sparse format (only for reading, no need to be able to produce them) I think it is okay to avoid implementing the older sparse file formats though. |
…SparseStream Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
…ed offsets, fix CanSeek order, fix async allocation Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
Done in commit
Code review fixes applied: |
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHeader.Read.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Formats.Tar/tests/TarReader/TarReader.GetNextEntry.Tests.cs
Show resolved
Hide resolved
| // | ||
| // Virtual file layout (realsize=1024): | ||
| // [0..255] = segment 0 data (0x42 bytes) | ||
| // [256..1023] = sparse hole (zeros) |
There was a problem hiding this comment.
I would prefer having a test with a hole in the middle of the file.
There was a problem hiding this comment.
Done in c34240f2 — the inline test now uses a segment at offset 256 (not 0), giving layout: [0..255] zeros | [256..511] data (0x42) | [512..1023] zeros. This tests leading hole + data + trailing hole.
There was a problem hiding this comment.
That is not what I wanted, I wanted at least one test with two data segments with a hole between them. But you can keep the leading and trailing hole as well.
There was a problem hiding this comment.
Done in 583239cd. The inline test now uses two data segments with a hole between them: [0..255] zeros | [256..511] 0x42 (seg 0) | [512..767] zeros (middle hole) | [768..1023] 0x43 (seg 1) | [1024..2047] zeros (trailing hole).
| { | ||
| // pax-nil-sparse-data: one segment (offset=0, length=1000), realsize=1000, no holes. | ||
| // The packed data is 1000 bytes of "0123456789" repeating. | ||
| using MemoryStream archiveStream = GetTarMemoryStream(CompressionMethod.Uncompressed, "golang_tar", "pax-nil-sparse-data"); |
There was a problem hiding this comment.
Are there any other archives in the test assets that contain sparse files that we could use in testing?
There was a problem hiding this comment.
If not, add a separate test file containing GnuSparseStream unit tests to provide better coverage of the new stream class. Since the class is internal, you may need to use reflection for instantiating the class.
| private static ((long Offset, long Length)[] Segments, long DataStart) ParseSparseMap(Stream rawStream) | ||
| { | ||
| long bytesConsumed = 0; | ||
|
|
||
| long numSegments = ReadDecimalLine(rawStream, ref bytesConsumed); | ||
| if ((ulong)numSegments > MaxSparseSegments) | ||
| { | ||
| throw new InvalidDataException(SR.TarGnuSparseMapInvalidNumSegments); | ||
| } | ||
|
|
||
| var segments = new (long Offset, long Length)[numSegments]; | ||
| for (int i = 0; i < (int)numSegments; i++) | ||
| { | ||
| long offset = ReadDecimalLine(rawStream, ref bytesConsumed); | ||
| long length = ReadDecimalLine(rawStream, ref bytesConsumed); | ||
| if (offset < 0 || length < 0) | ||
| { | ||
| throw new InvalidDataException(SR.TarGnuSparseMapInvalidSegment); | ||
| } | ||
| segments[i] = (offset, length); | ||
| } | ||
|
|
||
| // Skip padding bytes to align to the next 512-byte block boundary. | ||
| int padding = TarHelpers.CalculatePadding(bytesConsumed); | ||
| if (padding > 0) | ||
| { | ||
| TarHelpers.AdvanceStream(rawStream, padding); | ||
| } | ||
|
|
||
| long dataStart = bytesConsumed + padding; | ||
| return (segments, dataStart); | ||
| } |
There was a problem hiding this comment.
Instead of reading the stream byte by byte, can we read the map in chunks of 512B blocks and use the ArrayBuffer helper to manage it? The ArrayHelper struct is in common System.Net code, but I think we can move it simply to private shared System code.
There was a problem hiding this comment.
I asked you to use the ArrayBuffer for buffer management. Also, when you have have contents buffered this way you can use FindFirst('\n') to make the parsing faster.
There was a problem hiding this comment.
Done in 583239cd. ParseSparseMap (sync and async) now uses a sliding-window buffer (activeStart/availableStart over a 512-byte array, mirroring ArrayBuffer's semantics) and IndexOf((byte)'\n') to locate line boundaries efficiently, then parses the span directly via ParseDecimalSpan. The old per-byte ReadDecimalLine/ReadDecimalLineAsync delegates are removed.
src/libraries/System.Formats.Tar/src/System/Formats/Tar/GnuSparseStream.cs
Outdated
Show resolved
Hide resolved
…e guard, chunked parsing, segment tracking, rename method, fix position tracking, update tests Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
All review comments addressed in
|
| // | ||
| // Virtual file layout (realsize=1024): | ||
| // [0..255] = segment 0 data (0x42 bytes) | ||
| // [256..1023] = sparse hole (zeros) |
There was a problem hiding this comment.
That is not what I wanted, I wanted at least one test with two data segments with a hole between them. But you can keep the leading and trailing hole as well.
| // GNU sparse format 1.0 (encoded via PAX) stores the real file name in 'GNU.sparse.name', | ||
| // which overrides the placeholder path (e.g. 'GNUSparseFile.0/...') stored in the 'path' attribute. | ||
| // PAX 1.0 sparse entries use TarEntryType.RegularFile (typeFlag '0'), not SparseFile ('S'). | ||
| if (_typeFlag is TarEntryType.RegularFile or TarEntryType.V7RegularFile && | ||
| ExtendedAttributes.TryGetValue(PaxEaGnuSparseName, out string? gnuSparseName)) | ||
| { | ||
| _name = gnuSparseName; | ||
| } |
There was a problem hiding this comment.
Can we group the sparse file-related entries together in one big branch based on the type flag?
There was a problem hiding this comment.
Done in 583239cd. All three GNU sparse 1.0 PAX attributes (GNU.sparse.name, GNU.sparse.realsize, GNU.sparse.major/minor) are now handled in a single if (_typeFlag is TarEntryType.RegularFile or TarEntryType.V7RegularFile) block.
| private static ((long Offset, long Length)[] Segments, long DataStart) ParseSparseMap(Stream rawStream) | ||
| { | ||
| long bytesConsumed = 0; | ||
|
|
||
| long numSegments = ReadDecimalLine(rawStream, ref bytesConsumed); | ||
| if ((ulong)numSegments > MaxSparseSegments) | ||
| { | ||
| throw new InvalidDataException(SR.TarGnuSparseMapInvalidNumSegments); | ||
| } | ||
|
|
||
| var segments = new (long Offset, long Length)[numSegments]; | ||
| for (int i = 0; i < (int)numSegments; i++) | ||
| { | ||
| long offset = ReadDecimalLine(rawStream, ref bytesConsumed); | ||
| long length = ReadDecimalLine(rawStream, ref bytesConsumed); | ||
| if (offset < 0 || length < 0) | ||
| { | ||
| throw new InvalidDataException(SR.TarGnuSparseMapInvalidSegment); | ||
| } | ||
| segments[i] = (offset, length); | ||
| } | ||
|
|
||
| // Skip padding bytes to align to the next 512-byte block boundary. | ||
| int padding = TarHelpers.CalculatePadding(bytesConsumed); | ||
| if (padding > 0) | ||
| { | ||
| TarHelpers.AdvanceStream(rawStream, padding); | ||
| } | ||
|
|
||
| long dataStart = bytesConsumed + padding; | ||
| return (segments, dataStart); | ||
| } |
There was a problem hiding this comment.
I asked you to use the ArrayBuffer for buffer management. Also, when you have have contents buffered this way you can use FindFirst('\n') to make the parsing faster.
| /// it is exercised through TarReader's public DataStream property using | ||
| /// programmatically constructed PAX 1.0 sparse archives. | ||
| /// </summary> | ||
| public class GnuSparseStreamTests : TarTestsBase |
There was a problem hiding this comment.
Since we are testing the sparse files through TarReader, we should move all tests to something like TarReader.SparseFile.Tests.cs
There was a problem hiding this comment.
@copilot I don't see many tests for handling corrupted formats, can you add them?
There was a problem hiding this comment.
Done in 498e3e94. Renamed GnuSparseStream.Tests.cs to TarReader.SparseFile.Tests.cs (class renamed to TarReader_SparseFileTests) and added the following corrupted-format tests:
CorruptedSparseMap_InvalidDataException(Theory with 8 cases): non-numeric segment count, empty segment count, non-numeric offset/length, negative offset/length, truncated map (missing length, missing both)CorruptedSparseMap_InvalidDataException_Async: same 8 cases viaGetNextEntryAsyncCorruptedSparseMap_TruncatedAfterSegmentCount_InvalidDataException: segment count present but offset/length missingMissingSparseAttributes_EntryReadAsNormal: missingGNU.sparse.minor→ not treated as sparse 1.0WrongMajorMinor_EntryReadAsNormal:GNU.sparse.major=2→ not treated as sparse 1.0
…d map parsing, two-segment test Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
All remaining comments addressed in
|
…mat tests Co-authored-by: rzikm <32671551+rzikm@users.noreply.github.com>
TarReaderwas not handling GNU sparse format 1.0 PAX entries, causing ~46% of entries from bsdtar-created archives (e.g., .NET SDK tarballs built on macOS/APFS) to expose internal placeholder paths likeGNUSparseFile.0/real-file.dll, incorrect sizes, and corrupted extracted content.Description
GnuSparseStream.cs(new)Streamimplementation that wraps the raw data section of a GNU sparse format 1.0 PAX entry and expands it to the virtual file sizenumSegs\n, then pairs ofoffset\n numbytes\n) from the start of the data section using a sliding-window 512-byte buffer withIndexOf('\n')for efficient line boundary detectionrealsizebytesMaxSparseSegments = 1_000_000cap prevents DoS via malformed archives with huge segment counts_packedStartOffsets) for O(1) packed-offset lookup during reads_currentSegmentIndexfield tracks the current position sequentially, avoiding binary search on every readTarHeader.csPaxEaGnuSparseName(GNU.sparse.name),PaxEaGnuSparseRealSize(GNU.sparse.realsize),PaxEaGnuSparseMajor(GNU.sparse.major), andPaxEaGnuSparseMinor(GNU.sparse.minor) constants_gnuSparseRealSizefield (separate from_sizeto preserve archive stream positioning) and_isGnuSparse10flagTarHeader.Read.cs—ReplaceNormalAttributesWithExtended()GNU.sparse.name,GNU.sparse.realsize,GNU.sparse.major/minor) is grouped in a singleif (_typeFlag is RegularFile or V7RegularFile)block — PAX 1.0 encodes sparse entries as regular files, not theSparseFiletype_namewithGNU.sparse.namewhen present (replaces theGNUSparseFile.0/…placeholder)GNU.sparse.realsizeinto_gnuSparseRealSizewithout touching_size_isGnuSparse10 = trueonly when bothGNU.sparse.major=1andGNU.sparse.minor=0are presentGnuSparseStreamwhen_isGnuSparse10 && _gnuSparseRealSize > 0TarEntry.csLengthreturns the expanded real size viaDataStream.Lengthfor GNU sparse 1.0 entriesTarReader.csGetSubReadStream(the method only unwrapsGnuSparseStreamto reach the underlyingSubReadStream; it never advanced to end)Strings.resxTarGnuSparseMapInvalid*error strings; reuses the existingTarInvalidNumberstring for all sparse map parse errorsTesting
System.Formats.Tartests passTarReader.GetNextEntry.Tests.cs(sync + async,copyData=falseandcopyData=true) verify:[0..255] zeros|[256..511] seg0|[512..767] hole|[768..1023] seg1|[1024..2047] zeros— verifies leading hole, middle hole, trailing hole, and two independent data segmentspax-nil-sparse-data(golang corpus): full content verification — 1000-byte file,"0123456789"repeatingpax-nil-sparse-hole(golang corpus): full content verification — 1000-byte all-zero filepax-sparse-big(golang corpus): 60 GB virtual file metadata resolved correctlyTarEntryType.RegularFile(PAX 1.0 format uses type'0', not'S')TarReader.SparseFile.Tests.cswith tests exercised throughTarReader's publicDataStreamproperty, covering:InvalidDataExceptionexpected): non-numeric segment count, empty segment count line, non-numeric offset/length, negative offset/length, truncated map (missing length line, missing both offset and length lines), truncated after segment countGetNextEntryAsyncGNU.sparse.minor, wrongGNU.sparse.major(not1) — entry read as plain regular file without sparse expansionOriginal prompt
This section details on the original issue you should resolve
<issue_title>TarReader doesn't handle GNU sparse format 1.0 (PAX) - exposes GNUSparseFile.0 placeholder paths</issue_title>
<issue_description>## Description
System.Formats.Tar.TarReaderdoes not handle GNU sparse format 1.0 entries encoded via PAX extended attributes. When reading such entries,TarEntry.Namereturns the internal placeholder path (containingGNUSparseFile.0) instead of the real file name, andTarEntry.Lengthreturns the stored (sparse) size rather than the real file size.GNU sparse format 1.0 stores the real name and size in PAX extended attributes:
GNU.sparse.name— the real file pathGNU.sparse.realsize— the real file sizeTarHeader.ReplaceNormalAttributesWithExtended()processes standard PAX attributes likepath,size,mtime, etc., but does not processGNU.sparse.nameorGNU.sparse.realsize.How this occurs in practice
macOS ships bsdtar (libarchive), which detects sparse files by default during archive creation. .NET DLLs on APFS have zero-filled PE alignment sections that APFS stores as filesystem holes, causing bsdtar to treat them as sparse and encode them with the GNU sparse PAX format.
The tar command producing the affected archive was:
When .NET's
TarReaderreads these archives, ~46% of entries have incorrect names containingGNUSparseFile.0.Reproduction Steps
Option 1 — With an affected tar.gz file
Download an affected tarball (a .NET SDK built on macOS):
dotnet-sdk-11.0.100-ci-osx-x64.tar.gz
Then run the repro program (below) against it.
Option 2 — Create a sparse tar.gz on macOS
On a Mac, create a sparse file and archive it:
Then read it on any platform with the repro program below.
Repro Program
Program.cs:
tar-repro.csproj:
Expected behavior
For entries with
GNU.sparse.nameandGNU.sparse.realsizePAX extended attributes:entry.Nameshould return the value ofGNU.sparse.name(e.g.,./shared/Microsoft.NETCore.App/11.0.0-ci/Microsoft.CSharp.dll)entry.Lengthshould return the value of `GNU.sparse.r...🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.