Skip to content

Fix installation of wheel with non-ASCII entries#8223

Merged
pradyunsg merged 9 commits into
pypa:masterfrom
uranusjr:unicode-wheel
May 19, 2020
Merged

Fix installation of wheel with non-ASCII entries#8223
pradyunsg merged 9 commits into
pypa:masterfrom
uranusjr:unicode-wheel

Conversation

@uranusjr

@uranusjr uranusjr commented May 12, 2020

Copy link
Copy Markdown
Member

This mainly deals with correctly recording the wheel content in the RECORD metadata. This metadata file must be written in UTF-8, but the actual files need to be installed to the filesystem, the encoding of which is not (always) UTF-8. So we need to carefully handle file name encoding/decoding when comparing RECORD entries to the actual file.

The fix here makes sure we always use the correct encoding by adding strict type hints. The entries in RECORD is decoded/encoded with UTF-8 on the read/write boundaries to make sure we always deal with text types. A type-hint-only type RecordPath is introduced to make sure this is enforced (because Python 2 "helpfully" coerces str to unicode with the wrong encoding).

Close #7138, close #7574.

@uranusjr uranusjr added OS: windows Windows specific C: encoding Related to text encoding and likely, UnicodeErrors type: bugfix labels May 12, 2020
@uranusjr

Copy link
Copy Markdown
Member Author

Oh wow, it seems like flake8 just released a new version, and the new rules borked all our linting jobs.

@pfmoore

pfmoore commented May 12, 2020

Copy link
Copy Markdown
Member

We should scream at them, so they can tell us we should have pinned our dependency and done a more controlled rollout of the new version 🙂

Seriously we should pin for now, and discuss the new rules. I'm not sure I want to just knee-jerk "fix" the failures here (why shouldn't we call a variable in a comprehension "l" for a line?).

@pradyunsg

Copy link
Copy Markdown
Member

Fix merged. :)

Rebasing on master will make the warnings vanish! :)

@pradyunsg

Copy link
Copy Markdown
Member

Seriously we should pin for now, and discuss the new rules. I'm not sure I want to just knee-jerk "fix" the failures here (why shouldn't we call a variable in a comprehension "l" for a line?).

So... The issue was the flake8 started providing it's pre-commit hook properly, and we needed to update to use those instead of a deprecated hook. #8227 did that. :)

Comment thread tests/unit/test_wheel.py
@pradyunsg pradyunsg mentioned this pull request May 12, 2020
@pradyunsg pradyunsg self-requested a review May 12, 2020 22:58
Comment thread src/pip/_internal/operations/install/wheel.py Outdated
@julienmalard

Copy link
Copy Markdown

Non-ASCII names are not weird

Thank you, thank you for saying that!

Regarding checks, one possible thing to check for would be that abugidas can be well represented. Because of a bug in re, many Python programs, even ones not using the dreaded [A-Za-z_] regex, still fail with vowel diactrics.

I am not sure how the new tests were implemented here, but would it be possible to include testing for modules with the following names? Even if re was not the culprit here, I have a feeling that, since it is often used in pip, it may end up causing problems elsewhere if we don't explicitly test:

  1. اسالام # Often works
  2. 你好 # Often works
  3. வணக்கம் # Often fails because of the diactrics

Thanks!

@uranusjr

Copy link
Copy Markdown
Member Author

Let me use pytest.mark.parametrize on the test so we can add more cases when we know about them.

@pradyunsg pradyunsg added the needs rebase or merge PR has conflicts with current master label May 19, 2020
@pradyunsg

Copy link
Copy Markdown
Member

@uranusjr FYI: This needs a rebase.

uranusjr added 9 commits May 19, 2020 16:03
This mainly deals with correctly recording the wheel content in the
RECORD metadata. This metadata file must be written in UTF-8, but the
actual files need to be installed to the filesystem, the encoding of
which is not (always) UTF-8. So we need to carefully handle file name
encoding/decoding when comparing RECORD entries to the actual file.

The fix here makes sure we always use the correct encoding by adding
strict type hints. The entries in RECORD is decoded/encoded with UTF-8
on the read/write boundaries to make sure we always deal with text
types. A type-hint-only type RecordPath is introduced to make sure this
is enforced (because Python 2 "helpfully" coerces str to unicode with
the wrong encoding).
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 19, 2020
@uranusjr uranusjr requested a review from pradyunsg May 19, 2020 08:05
@pradyunsg pradyunsg merged commit 15f0863 into pypa:master May 19, 2020
@uranusjr uranusjr deleted the unicode-wheel branch May 21, 2020 05:12
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Oct 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

C: encoding Related to text encoding and likely, UnicodeErrors OS: windows Windows specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants