Skip to content

Ambiguous version parsing in parse_sdist_filename #527

@woodruffw

Description

@woodruffw

Hi there! First of all, thanks for continuing to maintain this package -- it's extremely useful 🙂

I'm one of the maintainers of pip-audit, and we had a user report some strange dependency resolution behavior: pypa/pip-audit#248

We were able to root-cause the bug down to a release of cffi (1.0.2-2) that uses the implicit post releases syntax for specifying the post-release number, rather than the canonicalized postN format. This release of cffi is published on PyPI here, without canonicalization, so it's likely that it was uploaded before PyPI began normalizing versions.

Because 1.0.2-2 contains a dash, the following body of packaging.utils.parse_sdist_filename contains an incorrect assumption and parses the source distribution name incorrectly:

def parse_sdist_filename(filename: str) -> Tuple[NormalizedName, Version]:
    if filename.endswith(".tar.gz"):
        file_stem = filename[: -len(".tar.gz")]
    elif filename.endswith(".zip"):
        file_stem = filename[: -len(".zip")]
    else:
        raise InvalidSdistFilename(
            f"Invalid sdist filename (extension must be '.tar.gz' or '.zip'):"
            f" {filename}"
        )

    # We are requiring a PEP 440 version, which cannot contain dashes,
    # so we split on the last dash.
    name_part, sep, version_part = file_stem.rpartition("-")
    if not sep:
        raise InvalidSdistFilename(f"Invalid sdist filename: {filename}")

    name = canonicalize_name(name_part)
    version = Version(version_part)
    return (name, version)

yielding:

>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi-1-0-2', <Version('2')>)

whereas we expected:

>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi', <Version('1.0.2.post2')>)

TL;DR: parse_sdist_filename shouldn't rely on the last dash as a separator between the distribution name and the version, since PEP 440 allows dashes in non-normalized versions. Parsing this correctly poses a bit of a challenge, since distribution names can also contain dashes and numbers and might even contain them in pathological ways, such as:

# package foo3, version 1.0.0.post1
foo3-1.0.0-1.tar.gz

# package foo-3, version 1.0.0.post1
foo-3-1.0.0-1.tar.gz

# package 3_3, version 1.0.0.post1
# i'm not sure this one is valid, but i can't find a countervailing spec in any of the packaging PEPs
3-3-1.0.0-1.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions