Skip to content

Update Unicode handling for Python 3#214

Merged
skjerns merged 1 commit intoholgern:masterfrom
DimitriPapadopoulos:unicode
Jun 27, 2023
Merged

Update Unicode handling for Python 3#214
skjerns merged 1 commit intoholgern:masterfrom
DimitriPapadopoulos:unicode

Conversation

@DimitriPapadopoulos
Copy link
Copy Markdown
Contributor

No description provided.

return s.decode("latin")
else:
return s.decode("utf-8", "strict")
return s.decode("utf_8", "strict")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not be equivalent?

Copy link
Copy Markdown
Contributor Author

@DimitriPapadopoulos DimitriPapadopoulos Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but utf_8 is the documented base name for the UTF-8 codec:
https://docs.python.org/3/library/codecs.html#standard-encodings

The rest are aliases. While utf-8 is not in the explicit list of aliases, - is equivalent to _:

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. Neither the list of aliases nor the list of languages is meant to be exhaustive. Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.

CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.

@skjerns
Copy link
Copy Markdown
Collaborator

skjerns commented Jun 23, 2023

Thanks! looks good :) could you squash the commits to a single commit with a meaningful message? that helps to keep the commit history a bit cleaner.

- Get rid of unicode(). In Python 3, `unicode` is an alias of `str`.
  No need to cast a `str` to a `str`.

- Consistently use the base name `utf_8` for the UTF-8 codec.
  https://docs.python.org/3/library/codecs.html#standard-encodings

- Remove a piece of code copied from
  https://cython.readthedocs.io/en/latest/src/tutorial/strings.html
  Replace with the relevant code from teh overhauled Python 3 doc:
  https://github.com/minrk/cython-docs/blob/master/src/tutorial/strings.rst
@DimitriPapadopoulos
Copy link
Copy Markdown
Contributor Author

Done.

@skjerns skjerns merged commit fca95ce into holgern:master Jun 27, 2023
@DimitriPapadopoulos DimitriPapadopoulos deleted the unicode branch June 27, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants