Skip to content

BUG: Use whole file for encoding checks with charset_normalizer.#22872

Merged
charris merged 7 commits intonumpy:mainfrom
HaoZeke:fixEncoding
Dec 25, 2022
Merged

BUG: Use whole file for encoding checks with charset_normalizer.#22872
charris merged 7 commits intonumpy:mainfrom
HaoZeke:fixEncoding

Conversation

@HaoZeke
Copy link
Copy Markdown
Member

@HaoZeke HaoZeke commented Dec 22, 2022

Closes #22871.

The issue is that the current behavior uses only the first 32 bytes. The current variation (reading in the whole file) should be fine, fortran files are rarely large enough for this to be a practical bottleneck (tentatively).

EDIT: Now this has a slightly larger changelog

  • Switch from chardet to charset_normalizer
  • Rework to keep old behavior (bytes read) without charset_normalizer

The rationale here is that if there are encoding errors, attempting to determine the encoding with startswith doesn't need more than the old number of bytes anyway.

charset_normalizer will use the whole file to check the encoding.

@charris
Copy link
Copy Markdown
Member

charris commented Dec 22, 2022

@HaoZeke Could you add a test for this?

The travis failures can be ignored, the cause is known and will be fixed soon.

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Dec 22, 2022
@charris charris added this to the 1.24.1 release milestone Dec 22, 2022
@HaoZeke
Copy link
Copy Markdown
Member Author

HaoZeke commented Dec 23, 2022

@HaoZeke Could you add a test for this?

Sure, in a bit, I'll have to add chardet to the CI though, since it is needed to pass the test. Actually, maybe it should be an optional requirement of numpy now?

The travis failures can be ignored, the cause is known and will be fixed soon.

Excellent.

@melissawm
Copy link
Copy Markdown
Member

fortran files are rarely large enough for this to be a practical bottleneck (tentatively).

As a nitpick, should we also add this as a comment either on the code or the test you are going to add? Other than that, looks fine to me :)

@charris
Copy link
Copy Markdown
Member

charris commented Dec 24, 2022

Hmm. Looks like chardet is under the LGPL. It is already an optional dependency of crackfortran.py, so I guess making it an optional dependency for testing doesn't change anything, but I do wonder if we have any related license requirements because it is in the code, i.e., mentioning the license, even when chardet use is optional. IANAL and have trouble understanding the license text, but since use is optional perhaps the requirements don't apply. @rgommers Thoughts?

@charris
Copy link
Copy Markdown
Member

charris commented Dec 24, 2022

@HaoZeke If we keep the dependency, it should also be added to test_requirements.txt.

@mattip
Copy link
Copy Markdown
Member

mattip commented Dec 25, 2022

Maybe we can switch from chardet to the MIT-licensed charset_normalizer, there is a FAQ about compatibility with chardet

@HaoZeke
Copy link
Copy Markdown
Member Author

HaoZeke commented Dec 25, 2022

charset_normalizer seems to be better maintained and works just as well. Also there's a convenience function from_path which cleans up the original logic a bit too, so it seems like a win-win.

@HaoZeke HaoZeke changed the title BUG: Use whole file for encoding checks [f2py] BUG: Use whole file for encoding checks with charset_normalizer [f2py] Dec 25, 2022
@HaoZeke
Copy link
Copy Markdown
Member Author

HaoZeke commented Dec 25, 2022

Some of the newest change disallows previously allowed (but highly unlikely to be present) code-paths, namely having fixed form F77 code in a fortran 90 file (with .f90).

Does this maybe need a release note perhaps?

EDIT: I'm moving this change out of this PR to #22885, the tests can be fixed without it.

@charris
Copy link
Copy Markdown
Member

charris commented Dec 25, 2022

EDIT: I'm moving this change out of this PR to #22885, the tests can be fixed without it.

Thanks, I was going to ask for that.

@charris charris merged commit fe73a84 into numpy:main Dec 25, 2022
@charris
Copy link
Copy Markdown
Member

charris commented Dec 25, 2022

Thanks @HaoZeke .

charris pushed a commit to charris/numpy that referenced this pull request Dec 25, 2022
…py] (numpy#22872)

* BUG: Use whole file for encoding checks [f2py]

* DOC: Add a code comment

Co-authored-by: melissawm <melissawm@gmail.com>

* TST: Add a conditional unicode f2py test

* MAINT: Add chardet as a test requirement

* ENH: Cleanup and switch f2py to charset_normalizer

* MAINT: Remove chardet for charset_normalizer

* TST: Simplify UTF-8 encoding [f2py]

Co-authored-by: melissawm <melissawm@gmail.com>
@charris charris changed the title BUG: Use whole file for encoding checks with charset_normalizer [f2py] BUG: Use whole file for encoding checks with `charset_normalizer ` Dec 25, 2022
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Dec 25, 2022
@charris charris removed this from the 1.24.1 release milestone Dec 25, 2022
@charris charris changed the title BUG: Use whole file for encoding checks with `charset_normalizer ` BUG: Use whole file for encoding checks with `charset_normalizer `. Dec 25, 2022
@charris charris changed the title BUG: Use whole file for encoding checks with `charset_normalizer `. BUG: Use whole file for encoding checks with charset_normalizer. Dec 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: f2py cannot handle non-ascii characters in comments since 1.24

4 participants