Description
If I run
and the commit messages contain Unicode characters like 🤦🏻♂️ (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb) then I get the following traceback
Traceback (most recent call last):
File "/.../.venv/bin/cz", line 8, in <module>
sys.exit(main())
File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
args.func(conf, vars(args))()
File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
commits = git.get_commits(
File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
c = cmd.run(command)
File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>
The result of chardet.detect() here
|
stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"), |
is:
{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}
An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes fails. Using decode("utf-8") works fine. It looks like issue chardet/chardet#148 is related to this.
I think the fix would be something like this to replace these lines of code:
stdout, stderr = process.communicate()
return_code = process.returncode
try:
stdout_s = stdout.decode("utf-8") # Try this one first.
except UnicodeDecodeError:
result = chardet.detect(stdout) # Final result of the UniversalDetector’s prediction.
# Consider checking confidence value of the result?
stdout_s = stdout.decode(result["encoding"])
try:
stderr_s = stderr.decode("utf-8") # Try this one first.
except UnicodeDecodeError:
result = chardet.detect(stderr) # Final result of the UniversalDetector’s prediction.
# Consider checking confidence value of the result?
stderr_s = stderr.decode(result["encoding"])
return Command(stdout_s, stderr_s, stdout, stderr, return_code)
Steps to reproduce
Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.
Current behavior
cz throws an exception.
Desired behavior
cz creates a changelog.
Screenshots
No response
Environment
> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin
Description
If I run
and the commit messages contain Unicode characters like 🤦🏻♂️ (which is an eight-byte utf-8 sequence:
\xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb) then I get the following tracebackThe result of
chardet.detect()herecommitizen/commitizen/cmd.py
Line 26 in 2ff9f15
is:
An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the
bytesfails. Usingdecode("utf-8")works fine. It looks like issue chardet/chardet#148 is related to this.I think the fix would be something like this to replace these lines of code:
Steps to reproduce
Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.
Current behavior
czthrows an exception.Desired behavior
czcreates a changelog.Screenshots
No response
Environment