Skip to content

TarReader doesn't handle GNU sparse format 1.0 (PAX) - exposes GNUSparseFile.0 placeholder paths #125281

@mmitche

Description

@mmitche

Description

System.Formats.Tar.TarReader does not handle GNU sparse format 1.0 entries encoded via PAX extended attributes. When reading such entries, TarEntry.Name returns the internal placeholder path (containing GNUSparseFile.0) instead of the real file name, and TarEntry.Length returns the stored (sparse) size rather than the real file size.

GNU sparse format 1.0 stores the real name and size in PAX extended attributes:

  • GNU.sparse.name — the real file path
  • GNU.sparse.realsize — the real file size

TarHeader.ReplaceNormalAttributesWithExtended() processes standard PAX attributes like path, size, mtime, etc., but does not process GNU.sparse.name or GNU.sparse.realsize.

How this occurs in practice

macOS ships bsdtar (libarchive), which detects sparse files by default during archive creation. .NET DLLs on APFS have zero-filled PE alignment sections that APFS stores as filesystem holes, causing bsdtar to treat them as sparse and encode them with the GNU sparse PAX format.

The tar command producing the affected archive was:

tar -cf - . | pigz > output.tar.gz

When .NET's TarReader reads these archives, ~46% of entries have incorrect names containing GNUSparseFile.0.

Reproduction Steps

Option 1 — With an affected tar.gz file

Download an affected tarball (a .NET SDK built on macOS):
dotnet-sdk-11.0.100-ci-osx-x64.tar.gz

Then run the repro program (below) against it.

Option 2 — Create a sparse tar.gz on macOS

On a Mac, create a sparse file and archive it:

# Create a file with sparse holes
dd if=/dev/zero of=sparse.bin bs=1 count=0 seek=1048576
echo "hello" >> sparse.bin

# Archive it (bsdtar detects sparse by default)
tar -czf sparse.tar.gz sparse.bin

Then read it on any platform with the repro program below.

Repro Program

Program.cs:

using System.Formats.Tar;
using System.IO.Compression;

if (args.Length == 0)
{
    Console.Error.WriteLine("Usage: dotnet run -- <path-to-tarball.tar.gz>");
    return 1;
}

string path = args[0];
if (!File.Exists(path))
{
    Console.Error.WriteLine($"File not found: {path}");
    return 1;
}

Console.WriteLine($"Reading: {path}");
Console.WriteLine();

int totalEntries = 0;
int sparseEntries = 0;

using FileStream fs = File.OpenRead(path);
using GZipStream gz = new(fs, CompressionMode.Decompress);
using TarReader reader = new(gz);

while (reader.GetNextEntry() is TarEntry entry)
{
    totalEntries++;

    if (entry is PaxTarEntry pax
        && pax.ExtendedAttributes.TryGetValue("GNU.sparse.name", out string? realName))
    {
        sparseEntries++;

        if (sparseEntries <= 5)
        {
            Console.WriteLine($"Entry #{totalEntries}:");
            Console.WriteLine($"  entry.Name (WRONG): {entry.Name}");
            Console.WriteLine($"  GNU.sparse.name   : {realName}");

            if (pax.ExtendedAttributes.TryGetValue("GNU.sparse.realsize", out string? realSize))
            {
                Console.WriteLine($"  entry.Length       : {entry.Length}");
                Console.WriteLine($"  GNU.sparse.realsize: {realSize}");
            }
            Console.WriteLine();
        }
    }
}

Console.WriteLine($"Total entries : {totalEntries}");
Console.WriteLine($"Sparse entries: {sparseEntries}");

if (sparseEntries > 0)
{
    Console.WriteLine();
    Console.WriteLine("BUG: TarReader exposes internal 'GNUSparseFile.0' placeholder paths");
    Console.WriteLine("     instead of using the real name from GNU.sparse.name.");
}

return sparseEntries > 0 ? 1 : 0;

tar-repro.csproj:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net9.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>
</Project>

Expected behavior

For entries with GNU.sparse.name and GNU.sparse.realsize PAX extended attributes:

  • entry.Name should return the value of GNU.sparse.name (e.g., ./shared/Microsoft.NETCore.App/11.0.0-ci/Microsoft.CSharp.dll)
  • entry.Length should return the value of GNU.sparse.realsize (e.g., 1115136)

Actual behavior

  • entry.Name returns the internal placeholder path (e.g., ./shared/Microsoft.NETCore.App/11.0.0-ci/GNUSparseFile.0/Microsoft.CSharp.dll)
  • entry.Length returns the stored/sparse size (e.g., 791040)

Example output from the repro against the linked tarball:

Reading: dotnet-sdk-11.0.100-ci-osx-x64.tar.gz

Entry #9:
  entry.Name (WRONG): ./shared/Microsoft.NETCore.App/11.0.0-ci/GNUSparseFile.0/Microsoft.CSharp.dll
  GNU.sparse.name   : ./shared/Microsoft.NETCore.App/11.0.0-ci/Microsoft.CSharp.dll
  entry.Length       : 791040
  GNU.sparse.realsize: 1115136

Total entries : 199
Sparse entries: 91

BUG: TarReader exposes internal 'GNUSparseFile.0' placeholder paths
     instead of using the real name from GNU.sparse.name.

Suggested Fix

In TarHeader.ReplaceNormalAttributesWithExtended(), add handling for the GNU sparse PAX attributes after the existing standard attribute processing:

// GNU sparse format 1.0 stores the real name and size in extended attributes.
// The header's name field contains an internal placeholder like "GNUSparseFile.0/...".
if (ExtendedAttributes.TryGetValue("GNU.sparse.name", out string? gnuSparseName))
{
    _name = gnuSparseName;
}

if (TarHelpers.TryGetStringAsBaseTenLong(ExtendedAttributes, "GNU.sparse.realsize", out long gnuSparseRealSize))
{
    _size = gnuSparseRealSize;
}

Configuration

  • Affects all .NET versions with System.Formats.Tar (net7.0+)
  • All platforms when reading archives created on macOS (or any system using bsdtar/libarchive with sparse detection)
  • The archive creation side can work around this with tar --no-read-sparse, but TarReader should handle this format correctly regardless

Impact

This is a real-world issue affecting .NET CI/CD infrastructure. Archives produced by macOS build agents contain GNU sparse PAX entries for .NET DLLs, and downstream tools using TarReader to process these archives (e.g., for code signing) encounter incorrect paths, leading to build failures.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions