BUG: Use whole file for encoding checks with charset_normalizer.#22872
BUG: Use whole file for encoding checks with charset_normalizer.#22872charris merged 7 commits intonumpy:mainfrom
charset_normalizer.#22872Conversation
|
@HaoZeke Could you add a test for this? The travis failures can be ignored, the cause is known and will be fixed soon. |
Sure, in a bit, I'll have to add
Excellent. |
As a nitpick, should we also add this as a comment either on the code or the test you are going to add? Other than that, looks fine to me :) |
Co-authored-by: melissawm <melissawm@gmail.com>
|
Hmm. Looks like |
|
@HaoZeke If we keep the dependency, it should also be added to |
|
Maybe we can switch from chardet to the MIT-licensed charset_normalizer, there is a FAQ about compatibility with chardet |
|
|
charset_normalizer [f2py]
|
Some of the newest change disallows previously allowed (but highly unlikely to be present) code-paths, namely having fixed form F77 code in a fortran 90 file (with Does this maybe need a release note perhaps? EDIT: I'm moving this change out of this PR to #22885, the tests can be fixed without it. |
Thanks, I was going to ask for that. |
|
Thanks @HaoZeke . |
…py] (numpy#22872) * BUG: Use whole file for encoding checks [f2py] * DOC: Add a code comment Co-authored-by: melissawm <melissawm@gmail.com> * TST: Add a conditional unicode f2py test * MAINT: Add chardet as a test requirement * ENH: Cleanup and switch f2py to charset_normalizer * MAINT: Remove chardet for charset_normalizer * TST: Simplify UTF-8 encoding [f2py] Co-authored-by: melissawm <melissawm@gmail.com>
charset_normalizer [f2py]charset_normalizer `
charset_normalizer `charset_normalizer `.
charset_normalizer `.charset_normalizer.
Closes #22871.
The issue is that the current behavior uses only the first
32bytes. The current variation (reading in the whole file) should be fine, fortran files are rarely large enough for this to be a practical bottleneck (tentatively).EDIT: Now this has a slightly larger changelog
chardettocharset_normalizercharset_normalizerThe rationale here is that if there are encoding errors, attempting to determine the encoding with
startswithdoesn't need more than the old number of bytes anyway.charset_normalizerwill use the whole file to check the encoding.