Skip to content

UnicodeDecodeError if commit messages contain Unicode characters #544

@jenstroeger

Description

@jenstroeger

Description

If I run

cz changelog

and the commit messages contain Unicode characters like 🤦🏻‍♂️ (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb) then I get the following traceback

Traceback (most recent call last):
  File "/.../.venv/bin/cz", line 8, in <module>
    sys.exit(main())
  File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
    args.func(conf, vars(args))()
  File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
    commits = git.get_commits(
  File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
    c = cmd.run(command)
  File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
    stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>

The result of chardet.detect() here

stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),

is:

{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}

An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes fails. Using decode("utf-8") works fine. It looks like issue chardet/chardet#148 is related to this.

I think the fix would be something like this to replace these lines of code:

stdout, stderr = process.communicate()
return_code = process.returncode
try:
    stdout_s = stdout.decode("utf-8")  # Try this one first.
except UnicodeDecodeError:
    result = chardet.detect(stdout)  # Final result of the UniversalDetector’s prediction.
    # Consider checking confidence value of the result?
    stdout_s = stdout.decode(result["encoding"])
try:
    stderr_s = stderr.decode("utf-8")  # Try this one first.
except UnicodeDecodeError:
    result = chardet.detect(stderr)  # Final result of the UniversalDetector’s prediction.
    # Consider checking confidence value of the result?
    stderr_s = stderr.decode(result["encoding"])
return Command(stdout_s, stderr_s, stdout, stderr, return_code)

Steps to reproduce

Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.

Current behavior

cz throws an exception.

Desired behavior

cz creates a changelog.

Screenshots

No response

Environment

> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions