from prezutils import setup


setup(
    name="Python & Unicode Primer",
    author="Yann Kaiser",
    author_extra=[
        Twitter("@YannKsr"),
        Employer("Criteo", is_hiring=True, at="Palo Alto"),
        PyPI("clize", purpose="Turn functions into CLIs"),
    ],
)
          

Terminology

Unicode

Ordinal Description Example
U+0041 LATIN CAPITAL LETTER A A
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE É
U+516b CJK UNIFIED IDEOGRAPH-5168
U+1f40d SNAKE 🐍

Codec

Coder-decoder

Encoder + Decoder

Decoder

Restitutes some data in intelligible form

some data ⇒ decoder ⇒ Unicode data

Encoder

Takes intelligible data into some other format

Unicode ⇒ codec ⇒ some data

Name Scope
ascii English alphanumerics and punctuation
latin-1 Western Europe
utf-8 All of Unicode
utf-16 All of Unicode

(See the docs for codecs)

A É 🐍
41 C3 89 E5 85 AB F0 9F 90 8D

Recap

  • All your characters are belong to Unicode
  • Ethernet wire doesn't speak Unicode, need codec
  • export = encode, import = decode
  • UTF-8 and UTF-16 are codecs that handles all of Unicode

What Python has

bytes and unicode

str: bytes on Python 2

str: unicode on Python 3

(bytes).decode(codec)

(text).encode(codec)

Things that you can use text with:

  • json
  • requests
  • io.open

Recap

  • Two types: for bytes, for text
  • Python 2: str, unicode
  • Python 3: bytes, str
  • decode and encode methods
  • Some objects and functions do the conversions for you

Commmon issues

Trying to decode/encode at random

>>> u'Élodie'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

Trying to decode/encode at random on Python 2

>>> u'Élodie'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 0: ordinal not in range(128)

Trying to decode/encode at random on Python 2 with ASCII data

>>> u'Quick brown fox'.decode('utf-8')
u'Quick brown fox'

Mixing unicode and bytes

>>> u'Élodie' + ' et ' + 'Étienne'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Solution: The Unicode Sandwich

Decode when you acquire data
Unicode throughout your app
Encode when you send data out

Your keystrokes are also subject to encoding, and so is displaying the text.

>>> len(u'É')
1
>>> len('É')
2
>>> 'É'
'\xc3\x89'

(This is most obvious in Python 2, but the same happens in 3)

Python 3 tries to display unicode in your terminal

>>> '\N{SNOWMAN}'
'☃'
$ PYTHONIOENCODING=latin1 python3
>>> print('\N{SNOWMAN}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 0: ordinal not in range(256)
>>> '\N{SNOWMAN}'
'\u2603'
>>> '\xc9tienne'
'ienne'
>>> print(ascii('\xc9tienne'))
'\xc9tienne'

Recap

  • Python 2 tries to autoconvert. This hides mistakes until it is too late.
  • The Unicode sandwich:
  • Decode data as you get it
  • Encode data before you hand it off to IO
  • Python 2: Don't mix string types

$ tail -n +4 present.py | pygmentize
setup(
    name="Python & Unicode Primer",
    author="Yann Kaiser",
    author_extra=[
        Twitter("@YannKsr"),
        Employer("Criteo", is_hiring=True, at="Palo Alto"),
        PyPI("clize", purpose="Turn functions into CLIs"),
    ],
)
$ ./present.py
Presentation complete.