BUG: Use whole file for encoding checks with ``charset_normalizer``. by HaoZeke · Pull Request #22872 · numpy/numpy

HaoZeke · 2022-12-22T18:06:55Z

The issue is that the current behavior uses only the first 32 bytes. The current variation (reading in the whole file) should be fine, fortran files are rarely large enough for this to be a practical bottleneck (tentatively).

EDIT: Now this has a slightly larger changelog

Switch from chardet to charset_normalizer
Rework to keep old behavior (bytes read) without charset_normalizer

The rationale here is that if there are encoding errors, attempting to determine the encoding with startswith doesn't need more than the old number of bytes anyway.

charset_normalizer will use the whole file to check the encoding.

charris · 2022-12-22T19:18:31Z

@HaoZeke Could you add a test for this?

The travis failures can be ignored, the cause is known and will be fixed soon.

HaoZeke · 2022-12-23T09:28:18Z

@HaoZeke Could you add a test for this?

Sure, in a bit, I'll have to add chardet to the CI though, since it is needed to pass the test. Actually, maybe it should be an optional requirement of numpy now?

The travis failures can be ignored, the cause is known and will be fixed soon.

Excellent.

melissawm · 2022-12-23T19:06:34Z

fortran files are rarely large enough for this to be a practical bottleneck (tentatively).

As a nitpick, should we also add this as a comment either on the code or the test you are going to add? Other than that, looks fine to me :)

Co-authored-by: melissawm <melissawm@gmail.com>

charris · 2022-12-24T20:09:23Z

Hmm. Looks like chardet is under the LGPL. It is already an optional dependency of crackfortran.py, so I guess making it an optional dependency for testing doesn't change anything, but I do wonder if we have any related license requirements because it is in the code, i.e., mentioning the license, even when chardet use is optional. IANAL and have trouble understanding the license text, but since use is optional perhaps the requirements don't apply. @rgommers Thoughts?

charris · 2022-12-24T20:13:30Z

@HaoZeke If we keep the dependency, it should also be added to test_requirements.txt.

mattip · 2022-12-25T07:56:52Z

Maybe we can switch from chardet to the MIT-licensed charset_normalizer, there is a FAQ about compatibility with chardet

HaoZeke · 2022-12-25T10:38:00Z

charset_normalizer seems to be better maintained and works just as well. Also there's a convenience function from_path which cleans up the original logic a bit too, so it seems like a win-win.

HaoZeke · 2022-12-25T20:06:57Z

Some of the newest change disallows previously allowed (but highly unlikely to be present) code-paths, namely having fixed form F77 code in a fortran 90 file (with .f90).

Does this maybe need a release note perhaps?

EDIT: I'm moving this change out of this PR to #22885, the tests can be fixed without it.

charris · 2022-12-25T21:17:13Z

EDIT: I'm moving this change out of this PR to #22885, the tests can be fixed without it.

Thanks, I was going to ask for that.

numpy/f2py/tests/test_crackfortran.py

charris · 2022-12-25T23:22:02Z

Thanks @HaoZeke .

…py] (numpy#22872) * BUG: Use whole file for encoding checks [f2py] * DOC: Add a code comment Co-authored-by: melissawm <melissawm@gmail.com> * TST: Add a conditional unicode f2py test * MAINT: Add chardet as a test requirement * ENH: Cleanup and switch f2py to charset_normalizer * MAINT: Remove chardet for charset_normalizer * TST: Simplify UTF-8 encoding [f2py] Co-authored-by: melissawm <melissawm@gmail.com>

BUG: Use whole file for encoding checks [f2py]

3fc7dbc

HaoZeke added 00 - Bug component: numpy.f2py labels Dec 22, 2022

HaoZeke requested review from melissawm and pearu December 22, 2022 18:06

charris added the 09 - Backport-Candidate PRs tagged should be backported label Dec 22, 2022

charris added this to the 1.24.1 release milestone Dec 22, 2022

HaoZeke and others added 3 commits December 24, 2022 02:20

DOC: Add a code comment

fd6f961

Co-authored-by: melissawm <melissawm@gmail.com>

TST: Add a conditional unicode f2py test

583f20c

MAINT: Add chardet as a test requirement

a60cd0d

HaoZeke added 2 commits December 25, 2022 16:01

ENH: Cleanup and switch f2py to charset_normalizer

ccc38ee

MAINT: Remove chardet for charset_normalizer

ccc8fa9

HaoZeke changed the title ~~BUG: Use whole file for encoding checks [f2py]~~ BUG: Use whole file for encoding checks with charset_normalizer [f2py] Dec 25, 2022

HaoZeke force-pushed the fixEncoding branch from 635f8f4 to 45b2480 Compare December 25, 2022 18:24

HaoZeke force-pushed the fixEncoding branch from fb8b96d to e94861c Compare December 25, 2022 20:21

charris reviewed Dec 25, 2022

View reviewed changes

numpy/f2py/tests/test_crackfortran.py Outdated Show resolved Hide resolved

TST: Simplify UTF-8 encoding [f2py]

8534e43

HaoZeke force-pushed the fixEncoding branch from e94861c to 8534e43 Compare December 25, 2022 21:59

charris approved these changes Dec 25, 2022

View reviewed changes

charris merged commit fe73a84 into numpy:main Dec 25, 2022

charris mentioned this pull request Dec 25, 2022

BUG: Use whole file for encoding checks with charset_normalizer. #22887

Merged

charris changed the title ~~BUG: Use whole file for encoding checks with charset_normalizer [f2py]~~ BUG: Use whole file for encoding checks with `charset_normalizer ` Dec 25, 2022

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Dec 25, 2022

charris removed this from the 1.24.1 release milestone Dec 25, 2022

charris changed the title ~~BUG: Use whole file for encoding checks with `charset_normalizer `~~ BUG: Use whole file for encoding checks with `charset_normalizer `. Dec 25, 2022

charris changed the title ~~BUG: Use whole file for encoding checks with `charset_normalizer `.~~ BUG: Use whole file for encoding checks with charset_normalizer. Dec 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Use whole file for encoding checks with `charset_normalizer`.#22872

BUG: Use whole file for encoding checks with `charset_normalizer`.#22872
charris merged 7 commits intonumpy:mainfrom
HaoZeke:fixEncoding

HaoZeke commented Dec 22, 2022 •

edited

Loading

Uh oh!

charris commented Dec 22, 2022

Uh oh!

HaoZeke commented Dec 23, 2022

Uh oh!

melissawm commented Dec 23, 2022

Uh oh!

charris commented Dec 24, 2022 •

edited

Loading

Uh oh!

charris commented Dec 24, 2022

Uh oh!

mattip commented Dec 25, 2022

Uh oh!

HaoZeke commented Dec 25, 2022

Uh oh!

HaoZeke commented Dec 25, 2022 •

edited

Loading

Uh oh!

charris commented Dec 25, 2022

Uh oh!

Uh oh!

charris commented Dec 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

HaoZeke commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 22, 2022

Uh oh!

HaoZeke commented Dec 23, 2022

Uh oh!

melissawm commented Dec 23, 2022

Uh oh!

charris commented Dec 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 24, 2022

Uh oh!

mattip commented Dec 25, 2022

Uh oh!

HaoZeke commented Dec 25, 2022

Uh oh!

HaoZeke commented Dec 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Dec 25, 2022

Uh oh!

Uh oh!

charris commented Dec 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HaoZeke commented Dec 22, 2022 •

edited

Loading

charris commented Dec 24, 2022 •

edited

Loading

HaoZeke commented Dec 25, 2022 •

edited

Loading