Skip to content

feat: option to validate compressed tag set sort order in parse_wheel_filename#1150

Merged
henryiii merged 11 commits intopypa:mainfrom
r266-tech:fix/sorted-compressed-tag-sets-909
Apr 9, 2026
Merged

feat: option to validate compressed tag set sort order in parse_wheel_filename#1150
henryiii merged 11 commits intopypa:mainfrom
r266-tech:fix/sorted-compressed-tag-sets-909

Conversation

@r266-tech
Copy link
Copy Markdown
Contributor

Summary

Fixes #909.

parse_wheel_filename accepted wheel filenames where compressed tag sets had components in unsorted order, even though PEP 425 explicitly requires them to be sorted:

each tag in a filename can instead be a '.'-separated, sorted, set of tags

For example, pyvirtualcam-0.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl was accepted silently, but the correct filename is pyvirtualcam-0.13.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (since '2' < '_' in ASCII, manylinux2014_x86_64 sorts before manylinux_2_17_x86_64).

This was surfaced by a real-world interaction: auditwheel repair produces incorrectly-ordered tag strings (tracked at pypa/auditwheel#583), and packaging accepted them without error, masking the upstream bug.

Changes

  • src/packaging/utils.py: In parse_wheel_filename, before calling parse_tag, validate that each --separated component's .-separated parts are in lexicographic sorted order. Raises InvalidWheelFilename if not.

  • tests/test_utils.py:

    • Added two invalid cases to test_parse_wheel_invalid_filename:
      • pyvirtualcam-0.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (platform tags unsorted)
      • foo-1.0-py3.py2-none-any.whl (interpreter tags unsorted)
    • Added one valid case to test_parse_wheel_filename:
      • pyvirtualcam-0.13.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (correctly sorted)

Design note

The check is placed in parse_wheel_filename (not parse_tag) so that parse_tag remains a general-purpose utility without new validation semantics. Per brettcannon's comment on #909, the fix touches the tag-parsing path in utils.py.

@pradyunsg pradyunsg changed the title fix(utils): validate compressed tag set sort order in parse_wheel_filename (fixes #909) Validate compressed tag set sort order in parse_wheel_filename Apr 7, 2026
@notatallshaw
Copy link
Copy Markdown
Member

I would want to understand the real world impact of this change, in particular is this going to reject a lot of wheels uploaded to PyPI?

@r266-tech
Copy link
Copy Markdown
Contributor Author

Good question. The validation only triggers on compressed tag sets — filenames where a single component contains multiple dot-separated values (e.g. cp39.cp310-none-manylinux_2_17_x86_64.manylinux2014_x86_64). Simple tags like py3-none-any or cp311-cp311-linux_x86_64 are unaffected since there's nothing to sort-check.

All major build tools I'm aware of (wheel, setuptools, hatchling, flit, poetry, maturin) generate sorted compressed tags, so I'd expect the real-world impact to be very low. That said, I don't have data on what's actually on PyPI.

If this is a concern, I could scope the check to only parse_wheel_filename (not parse_tag) and make it a warning rather than an error, or we could gather data first by scanning PyPI's Simple API for filenames with unsorted compressed tags. Happy to do either — what approach would you prefer?

@notatallshaw
Copy link
Copy Markdown
Member

Sorry, I posted that before I realized similar discussion was taking place in #909 (comment), let's move discussion on what we want to do about compliance/backwards comparability there.

@r266-tech
Copy link
Copy Markdown
Contributor Author

CI failures highlight an important edge case: lexicographic string sorting doesn't work for version-embedded platform tags.

Failing examples:

  • macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64 — lexicographically 10_10 < 10_9 (char '1' < '9'), but the intended order is by macOS version
  • manylinux_2_17_x86_64.manylinux2014_x86_64manylinux2014 < manylinux_2_17 lexicographically ('2' < '_'), but both are valid real-world tags (numpy uses this exact filename on PyPI)

This confirms @notatallshaw's concern about real-world impact. The sorted-order check needs either:

  1. A version-aware comparator instead of plain string sort, or
  2. Narrower scope (only reject clearly pathological cases)

Happy to implement whichever direction you prefer — will wait for the discussion on #909 to settle before pushing a fix.

@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 8, 2026

The only reason to sort these is to have a completely deterministic filename. So there isn't an "intended" order, it's not there to make it easier for humans. So only simple sorting, which is easy to implement in other languages too (like Rust) makes sense, anything other than lexicographic sorting would be much harder to implement, slower, and would serve no purpose.

Copy link
Copy Markdown
Contributor

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be fine as an opt-in parameter. I don't think we can add it to a legacy function without the opt-in.

@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 8, 2026

I think we must make this opt-in for a legacy API. There are at least 370,069 wheels uploaded to PyPI (out of the 1,023,6875 wheels I checked with a compressed tag set) that have unsorted compressed tags. We need to be able to parse those old wheel filenames.

@r266-tech
Copy link
Copy Markdown
Contributor Author

Thanks for the data, @henryiii — 370K out of 10M+ is significant. I'll update this to add a validate=False parameter (opt-in) for the legacy parse_wheel_filename so existing wheels parse cleanly by default, and also fix the exception chaining (from None). Will push shortly.

@henryiii henryiii mentioned this pull request Apr 8, 2026
@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 8, 2026

Great, thanks! Auditwheel produced unsorted tag sets for many years, it got fixed only a few years ago. It didn't really matter originally, as it didn't produce multiple tags, but when it started making manylinux tags in both old and new styles, it did so unsorted, and it was at least a few years before sorting was added. I wouldn't be surprised if multi-tagged universal macOS wheels were sometimes wrong, too.

@r266-tech
Copy link
Copy Markdown
Contributor Author

Updated per review feedback:

  • parse_tag() and parse_wheel_filename() now accept validate_order=False (opt-in)
  • Unsorted compressed tags parse fine by default (backwards compatible)
  • Only rejected when validate_order=True is explicitly passed
  • Fixed exception chaining (raise ... from None)
  • Tests updated: unsorted tags valid by default, rejected with validate_order=True

@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 8, 2026

Mind running prek -a to clean up the new lines and such (or nox -s <something> should also be available)? I can if you prefer.

@r266-tech
Copy link
Copy Markdown
Contributor Author

Done — pushed a fix for the trailing newline and line length. Thanks for the heads-up on prek!

@henryiii henryiii changed the title Validate compressed tag set sort order in parse_wheel_filename feat: option to validate compressed tag set sort order in parse_wheel_filename Apr 8, 2026
@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 8, 2026

Wow, that's a strange test failure! The failure happens if you run the full test suite, but not if you run just tests/test_utils.py! Claude-haiku-4.5 in VSCode's copilot found the issue pretty quickly. There's a importlib.reload of tags.py in test_tags.py, which then causes this to be a different exception object than the one utils.py captures.

@henryiii henryiii force-pushed the fix/sorted-compressed-tag-sets-909 branch from 8820a00 to 4e0731c Compare April 8, 2026 22:10
@henryiii henryiii force-pushed the fix/sorted-compressed-tag-sets-909 branch from 4e0731c to 484cf52 Compare April 8, 2026 22:16
@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 8, 2026

I pulled the test fix out into #1152, this should work when that goes in.

@henryiii henryiii force-pushed the fix/sorted-compressed-tag-sets-909 branch from 484cf52 to 57ff8d9 Compare April 8, 2026 22:35
Per review suggestion from pradyunsg: the PEP 425 sorted-order invariant
is now enforced at the parse_tag level, raising ValueError when a
compressed tag set component is not sorted. parse_wheel_filename catches
this and re-raises as InvalidWheelFilename with the full filename for
better error context.
Per review suggestion from pradyunsg: the PEP 425 sorted-order invariant
is now enforced at the parse_tag level, raising ValueError when a
compressed tag set component is not sorted. parse_wheel_filename catches
this and re-raises as InvalidWheelFilename with the full filename for
better error context.
r266-tech and others added 5 commits April 9, 2026 16:10
Per review suggestion from pradyunsg: the PEP 425 sorted-order invariant
is now enforced at the parse_tag level, raising ValueError when a
compressed tag set component is not sorted. parse_wheel_filename catches
this and re-raises as InvalidWheelFilename with the full filename for
better error context.
Per review feedback: 370K+ wheels on PyPI have unsorted compressed
tag sets, so the check must be opt-in for backwards compatibility.

- Add validate_order=False to parse_tag() and parse_wheel_filename()
- Only check sorted order when validate_order=True
- Fix exception chaining (raise ... from None)
- Update tests: unsorted tags parse by default, rejected with validate
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
@r266-tech r266-tech force-pushed the fix/sorted-compressed-tag-sets-909 branch from 46d06a8 to 5884358 Compare April 9, 2026 08:10
@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 9, 2026

@woodruffw, @brettcannon, and/or @dstufft, since you engaged on #909, it would be nice to get a opinion here from at least one of you. I think opt-in is the best way to go for this legacy API, this looks good to me.

@brettcannon
Copy link
Copy Markdown
Member

I think opt-in is the best way to go for this legacy API, this looks good to me.

I agree with that.

@woodruffw
Copy link
Copy Markdown
Member

Yeah, opt-in seems reasonable given that it's a legacy API 🙂

@henryiii henryiii merged commit 905c90c into pypa:main Apr 9, 2026
57 checks passed
@henryiii
Copy link
Copy Markdown
Contributor

henryiii commented Apr 9, 2026

Thanks for these!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: parse_wheel_filenames accepts wheel filenames with unsorted compressed tag sets

6 participants