fix(cmd): improve character encoding detection for sub-commands#545
fix(cmd): improve character encoding detection for sub-commands#545jenstroeger wants to merge 1 commit intocommitizen-tools:masterfrom
Conversation
|
@woile I’m unsure why that one test failed when locally it worked on @gpongelli you might be interested in this change. |
| return bytes_.decode("utf-8") | ||
| except UnicodeDecodeError: | ||
| result = chardet.detect(bytes_) | ||
| return bytes_.decode(result["encoding"] or "utf-8") |
There was a problem hiding this comment.
Is the “or” part needed? Doing decode(“utf-8”) here could raise unhandled exception.
There was a problem hiding this comment.
Well, in case of
{'encoding': None, 'confidence': 0.0, 'language': None}
the decode() function should receive the "utf-8" encoding arg. Incidentally, that’s the default encoding too (docs) for all Python versions that commitizen supports (3.6 and later).
Leaving it there or removing it both work, I’m happy with either way.
Doing
decode(“utf-8”)here could raise unhandled exception.
And any other encoding could also raise an exception, too. Which is why @woile mentioned to propagate an error and I agree with that.
There was a problem hiding this comment.
Should we have a minimum threshold? If the confidence is low, what's the point? I would only return if it makes sense, something like:
try:
return bytes_.decode("utf-8")
except UnicodeDecodeError as e:
result = chardet.detect(bytes_)
if result["confidence"] > 0.8:
return bytes_.decode(result["encoding"] or "utf-8")
raise eThoughts?
There was a problem hiding this comment.
I agree with checking the confidence value, the interesting part will be to pick the threshold here 😉
@woile do you want to reraise the original UnicodeDecodeError if the confidence is too low, or define a new custom exception (e.g. CharacterSetDecodeError which carries the misidentified charset)?
There’s also the case where a high confidence can still fail to decode, which we should consider.
There was a problem hiding this comment.
Well, in case of
{'encoding': None, 'confidence': 0.0, 'language': None}the
decode()function should receive the"utf-8"encoding arg.
Since I’m using chardet, I’ve never received such dict as answer, but ok to keep that.
Doing
decode(“utf-8”)here could raise unhandled exception.And any other encoding could also raise an exception, too. Which is why @woile mentioned to propagate an error and I agree with that.
I did not see that issue, I agree with that comment.
There was a problem hiding this comment.
@jenstroeger I like the idea of a custom exception. And you are right that it can still fail. So maybe we can skip the threshold directly and return the failure:
try:
bytes_.decode("utf-8")
except UnicodeDecodeError as e:
result = chardet.detect(bytes_)
try:
return bytes_.decode(result["encoding"]) # here we don't need to try utf-8 again
except UnicodeDecodeError as e:
raise CharacterSetDecodeError("Could not decode this char", e)There was a problem hiding this comment.
Looking for chardet alternatives, charset_normalizer seems promising (py versions, supported languages and also emoji), I’ll try it doing some checks in next future.
I cannot try it at the moment, but if you can try the same strings that led to your errors and if it works correctly, I support its use instead of chardet.
There was a problem hiding this comment.
@jenstroeger do you have some time to update the code? otherwise let me know and I can take over 👍🏻
There was a problem hiding this comment.
I’ll do that tonight or tomorrow, it’s been busy today…
There was a problem hiding this comment.
I did this PR using charset_normalizer instead of chardet , can you try that branch and see if the issue still happens ?
Codecov Report
@@ Coverage Diff @@
## master #545 +/- ##
==========================================
- Coverage 98.26% 98.07% -0.19%
==========================================
Files 39 39
Lines 1556 1562 +6
==========================================
+ Hits 1529 1532 +3
- Misses 27 30 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Description
Fixes #544
Checklist
./scripts/formatand./scripts/testlocally to ensure this change passes linter check and testExpected behavior
See issue #544:
cznow creates the changelog without failing