Skip to content

fix(cmd): improve character encoding detection for sub-commands#545

Closed
jenstroeger wants to merge 1 commit intocommitizen-tools:masterfrom
jenstroeger:issue-544
Closed

fix(cmd): improve character encoding detection for sub-commands#545
jenstroeger wants to merge 1 commit intocommitizen-tools:masterfrom
jenstroeger:issue-544

Conversation

@jenstroeger
Copy link
Copy Markdown
Contributor

Description

Fixes #544

Checklist

  • Add test cases to all the changes you introduce
  • Run ./scripts/format and ./scripts/test locally to ensure this change passes linter check and test
  • Test the changes on the local machine manually
  • Update the documentation for the changes

Expected behavior

See issue #544: cz now creates the changelog without failing

@jenstroeger
Copy link
Copy Markdown
Contributor Author

jenstroeger commented Aug 3, 2022

@woile I’m unsure why that one test failed when locally it worked on git push. Unless something jumps out at you right now, I’ll look into it later.

@gpongelli you might be interested in this change.

Comment thread commitizen/cmd.py
return bytes_.decode("utf-8")
except UnicodeDecodeError:
result = chardet.detect(bytes_)
return bytes_.decode(result["encoding"] or "utf-8")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the “or” part needed? Doing decode(“utf-8”) here could raise unhandled exception.

Copy link
Copy Markdown
Contributor Author

@jenstroeger jenstroeger Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in case of

{'encoding': None, 'confidence': 0.0, 'language': None}

the decode() function should receive the "utf-8" encoding arg. Incidentally, that’s the default encoding too (docs) for all Python versions that commitizen supports (3.6 and later).

Leaving it there or removing it both work, I’m happy with either way.

Doing decode(“utf-8”) here could raise unhandled exception.

And any other encoding could also raise an exception, too. Which is why @woile mentioned to propagate an error and I agree with that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a minimum threshold? If the confidence is low, what's the point? I would only return if it makes sense, something like:

    try:
        return bytes_.decode("utf-8")
    except UnicodeDecodeError as e:
        result = chardet.detect(bytes_)
        if result["confidence"] > 0.8:
            return bytes_.decode(result["encoding"] or "utf-8")
        raise e

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with checking the confidence value, the interesting part will be to pick the threshold here 😉

@woile do you want to reraise the original UnicodeDecodeError if the confidence is too low, or define a new custom exception (e.g. CharacterSetDecodeError which carries the misidentified charset)?

There’s also the case where a high confidence can still fail to decode, which we should consider.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in case of

{'encoding': None, 'confidence': 0.0, 'language': None}

the decode() function should receive the "utf-8" encoding arg.

Since I’m using chardet, I’ve never received such dict as answer, but ok to keep that.

Doing decode(“utf-8”) here could raise unhandled exception.

And any other encoding could also raise an exception, too. Which is why @woile mentioned to propagate an error and I agree with that.

I did not see that issue, I agree with that comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenstroeger I like the idea of a custom exception. And you are right that it can still fail. So maybe we can skip the threshold directly and return the failure:

try:
     bytes_.decode("utf-8")
except UnicodeDecodeError as e:
    result = chardet.detect(bytes_)
    try:
        return bytes_.decode(result["encoding"])  # here we don't need to try utf-8 again
    except UnicodeDecodeError as e:
        raise CharacterSetDecodeError("Could not decode this char", e)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for chardet alternatives, charset_normalizer seems promising (py versions, supported languages and also emoji), I’ll try it doing some checks in next future.

I cannot try it at the moment, but if you can try the same strings that led to your errors and if it works correctly, I support its use instead of chardet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenstroeger do you have some time to update the code? otherwise let me know and I can take over 👍🏻

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll do that tonight or tomorrow, it’s been busy today…

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this PR using charset_normalizer instead of chardet , can you try that branch and see if the issue still happens ?

Comment thread commitizen/cmd.py Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Aug 4, 2022

Codecov Report

Merging #545 (9436617) into master (ed636ba) will decrease coverage by 0.18%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
- Coverage   98.26%   98.07%   -0.19%     
==========================================
  Files          39       39              
  Lines        1556     1562       +6     
==========================================
+ Hits         1529     1532       +3     
- Misses         27       30       +3     
Flag Coverage Δ
unittests 98.07% <57.14%> (-0.19%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
commitizen/cmd.py 85.00% <50.00%> (-15.00%) ⬇️
commitizen/__version__.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeDecodeError if commit messages contain Unicode characters

3 participants