fix(cmd): improve character encoding detection for sub-commands by jenstroeger · Pull Request #545 · commitizen-tools/commitizen

jenstroeger · 2022-08-03T12:35:21Z

Description

Fixes #544

Checklist

Add test cases to all the changes you introduce
Run ./scripts/format and ./scripts/test locally to ensure this change passes linter check and test
Test the changes on the local machine manually
Update the documentation for the changes

Expected behavior

See issue #544: cz now creates the changelog without failing

jenstroeger · 2022-08-03T13:50:37Z

@woile I’m unsure why that one test failed when locally it worked on git push. Unless something jumps out at you right now, I’ll look into it later.

@gpongelli you might be interested in this change.

gpongelli · 2022-08-03T21:42:33Z

+        return bytes_.decode("utf-8")
+    except UnicodeDecodeError:
+        result = chardet.detect(bytes_)
+        return bytes_.decode(result["encoding"] or "utf-8")


Is the “or” part needed? Doing decode(“utf-8”) here could raise unhandled exception.

Well, in case of

{'encoding': None, 'confidence': 0.0, 'language': None}

the decode() function should receive the "utf-8" encoding arg. Incidentally, that’s the default encoding too (docs) for all Python versions that commitizen supports (3.6 and later).

Leaving it there or removing it both work, I’m happy with either way.

Doing decode(“utf-8”) here could raise unhandled exception.

And any other encoding could also raise an exception, too. Which is why @woile mentioned to propagate an error and I agree with that.

Should we have a minimum threshold? If the confidence is low, what's the point? I would only return if it makes sense, something like:

try: return bytes_.decode("utf-8") except UnicodeDecodeError as e: result = chardet.detect(bytes_) if result["confidence"] > 0.8: return bytes_.decode(result["encoding"] or "utf-8") raise e

Thoughts?

I agree with checking the confidence value, the interesting part will be to pick the threshold here 😉

@woile do you want to reraise the original UnicodeDecodeError if the confidence is too low, or define a new custom exception (e.g. CharacterSetDecodeError which carries the misidentified charset)?

There’s also the case where a high confidence can still fail to decode, which we should consider.

Well, in case of

{'encoding': None, 'confidence': 0.0, 'language': None}

the decode() function should receive the "utf-8" encoding arg.

Since I’m using chardet, I’ve never received such dict as answer, but ok to keep that.

Doing decode(“utf-8”) here could raise unhandled exception.

And any other encoding could also raise an exception, too. Which is why @woile mentioned to propagate an error and I agree with that.

I did not see that issue, I agree with that comment.

@jenstroeger I like the idea of a custom exception. And you are right that it can still fail. So maybe we can skip the threshold directly and return the failure:

try: bytes_.decode("utf-8") except UnicodeDecodeError as e: result = chardet.detect(bytes_) try: return bytes_.decode(result["encoding"]) # here we don't need to try utf-8 again except UnicodeDecodeError as e: raise CharacterSetDecodeError("Could not decode this char", e)

Looking for chardet alternatives, charset_normalizer seems promising (py versions, supported languages and also emoji), I’ll try it doing some checks in next future.

I cannot try it at the moment, but if you can try the same strings that led to your errors and if it works correctly, I support its use instead of chardet.

@jenstroeger do you have some time to update the code? otherwise let me know and I can take over 👍🏻

I’ll do that tonight or tomorrow, it’s been busy today…

I did this PR using charset_normalizer instead of chardet , can you try that branch and see if the issue still happens ?

codecov · 2022-08-04T06:13:35Z

Codecov Report

Merging #545 (9436617) into master (ed636ba) will decrease coverage by 0.18%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
- Coverage   98.26%   98.07%   -0.19%     
==========================================
  Files          39       39              
  Lines        1556     1562       +6     
==========================================
+ Hits         1529     1532       +3     
- Misses         27       30       +3

Flag	Coverage Δ
unittests	`98.07% <57.14%> (-0.19%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
commitizen/cmd.py	`85.00% <50.00%> (-15.00%)`	⬇️
commitizen/__version__.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

gpongelli reviewed Aug 3, 2022

View reviewed changes

woile reviewed Aug 4, 2022

View reviewed changes

Comment thread commitizen/cmd.py Outdated

fix(cmd): improve character encoding detection for sub-commands

9436617

jenstroeger force-pushed the issue-544 branch from 2f36ff9 to 9436617 Compare August 4, 2022 07:00

jenstroeger mentioned this pull request Aug 5, 2022

UnicodeDecodeError if commit messages contain Unicode characters #544

Closed

gpongelli mentioned this pull request Aug 5, 2022

refactor(tool): use charset_normalizer instead of chardet #550

Merged

4 tasks

jenstroeger closed this Aug 5, 2022

jenstroeger mentioned this pull request Aug 9, 2022

fix(cmd): improve character encoding detection for sub-commands #552

Merged

4 tasks

Uh oh!

Conversation

jenstroeger commented Aug 3, 2022

Description

Checklist

Expected behavior

Uh oh!

jenstroeger commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jenstroeger Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jenstroeger commented Aug 3, 2022 •

edited

Loading

jenstroeger Aug 3, 2022 •

edited

Loading

codecov bot commented Aug 4, 2022 •

edited

Loading