from prezutils import setup
setup(
name="Python & Unicode Primer",
author="Yann Kaiser",
author_extra=[
Twitter("@YannKsr"),
Employer("Criteo", is_hiring=True, at="Palo Alto"),
PyPI("clize", purpose="Turn functions into CLIs"),
],
)
| Ordinal | Description | Example |
|---|---|---|
| U+0041 | LATIN CAPITAL LETTER A |
A |
| U+00C9 | LATIN CAPITAL LETTER E WITH ACUTE |
É |
| U+516b | CJK UNIFIED IDEOGRAPH-5168 |
八 |
| U+1f40d | SNAKE |
🐍 |
Coder-decoder
Encoder + Decoder
Restitutes some data in intelligible form
some data ⇒ decoder ⇒ Unicode data
Takes intelligible data into some other format
Unicode ⇒ codec ⇒ some data
| Name | Scope |
|---|---|
ascii |
English alphanumerics and punctuation |
latin-1 |
Western Europe |
utf-8 |
All of Unicode |
utf-16 |
All of Unicode |
(See the docs for codecs)
| A | É | 八 | 🐍 | ||||||
| 41 | C3 | 89 | E5 | 85 | AB | F0 | 9F | 90 | 8D |
bytes and unicode
str: bytes on Python 2
str: unicode on Python 3
(bytes).decode(codec)
(text).encode(codec)
Things that you can use text with:
jsonrequestsio.openstr, unicodebytes, strdecode and encode methodsTrying to decode/encode at random
>>> u'Élodie'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
Trying to decode/encode at random on Python 2
>>> u'Élodie'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 0: ordinal not in range(128)
Trying to decode/encode at random on Python 2 with ASCII data
>>> u'Quick brown fox'.decode('utf-8')
u'Quick brown fox'
Mixing unicode and bytes
>>> u'Élodie' + ' et ' + 'Étienne'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
| Decode when you acquire data |
| Unicode throughout your app |
| Encode when you send data out |
Your keystrokes are also subject to encoding, and so is displaying the text.
>>> len(u'É')
1
>>> len('É')
2
>>> 'É'
'\xc3\x89'
(This is most obvious in Python 2, but the same happens in 3)
Python 3 tries to display unicode in your terminal
>>> '\N{SNOWMAN}'
'☃'
$ PYTHONIOENCODING=latin1 python3
>>> print('\N{SNOWMAN}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 0: ordinal not in range(256)
>>> '\N{SNOWMAN}'
'\u2603'
>>> '\xc9tienne'
'ienne'
>>> print(ascii('\xc9tienne'))
'\xc9tienne'
$ tail -n +4 present.py | pygmentize
setup(
name="Python & Unicode Primer",
author="Yann Kaiser",
author_extra=[
Twitter("@YannKsr"),
Employer("Criteo", is_hiring=True, at="Palo Alto"),
PyPI("clize", purpose="Turn functions into CLIs"),
],
)
$ ./present.py
Presentation complete.