Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, when I define Xᵘ, Python (silently?) turns Xᵘ into Xu, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?

>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)

Solution:

Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

The NFKC form of both the super and subscript characters is the lowercase u:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'

So in the end, all you have is a single identifier, Xu:

>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
  1           0 LOAD_CONST               0 (42)
              2 STORE_NAME               0 (Xu)

  2           4 LOAD_NAME                1 (print)
              6 LOAD_NAME                0 (Xu)
              8 LOAD_NAME                0 (Xu)
             10 LOAD_NAME                0 (Xu)
             12 BUILD_TUPLE              3
             14 CALL_FUNCTION            1
             16 POP_TOP
             18 LOAD_CONST               1 (None)
             20 RETURN_VALUE

The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.

Identifiers are normalized to avoid many potential ‘look-alike’ bugs, where you’d otherwise could end up using both find() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.

Is __repr__ supposed to return bytes or unicode?

In Python 3 and Python 2, is __repr__ supposed to return bytes or unicode? A reference and quote would be ideal.

Here’s some information about 2-3 compatibility, but I don’t see the answer.

Solution:

The type is str (for both python2.x and python3.x):

>>> type(repr(object()))
<class 'str'>

This has to be the case because __str__ defaults to calling __repr__ if the former is not present, but __str__ has to return a str.

For those not aware, in python3.x, str is the type that represents unicode. In python2.x, str is the type that represents bytes.

Weird behaviour of non-ASCII Python identifiers

I have learnt from PEP 3131 that non-ASCII identifiers were supported in Python, though it’s not considered best practice.

However, I get this strange behaviour, where my 𝜏 identifier (U+1D70F) seems to be automatically converted to τ (U+03C4).

class Base(object):
    def __init__(self):
        self.𝜏 = 5 # defined with U+1D70F

a = Base()
print(a.𝜏)     # 5             # (U+1D70F)
print(a.τ)     # 5 as well     # (U+03C4) ? another way to access it?
d = a.__dict__ # {'τ':  5}     # (U+03C4) ? seems converted
print(d['τ'])  # 5             # (U+03C4) ? consistent with the conversion
print(d['𝜏'])  # KeyError: '𝜏' # (U+1D70F) ?! unexpected!

Is that expected behaviour? Why does this silent conversion occur? Does it have anything to see with NFKC normalization? I thought this was only for canonically ordering Unicode character sequences

Solution:

Per the documentation on identifiers:

All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC.

You can see that U+03C4 is the appropriate result using unicodedata:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', '𝜏')
'τ'

However, this conversion doesn’t apply to string literals, like the one you’re using as a dictionary key, hence it’s looking for the unconverted character in a dictionary that only contains the converted character.

self.𝜏 = 5  # implicitly converted to "self.τ = 5"
a.𝜏  # implicitly converted to "a.τ"
d['𝜏']  # not converted

You can see similar problems with e.g. string literals used with getattr:

>>> getattr(a, '𝜏')
Traceback (most recent call last):
  File "python", line 1, in <module>
AttributeError: 'Base' object has no attribute '𝜏'
>>> getattr(a, unicodedata.normalize('NFKD', '𝜏'))
5

Why does \R behave differently in regular expressions between Java 8 and Java 9?

The following code compiles in both Java 8 & 9, but behaves differently.

class Simple {
    static String sample = "\nEn un lugar\r\nde la Mancha\nde cuyo nombre\r\nno quiero acordame";

    public static void main(String args[]){
        String[] chunks = sample.split("\\R\\R");
        for (String chunk: chunks) {
            System.out.println("Chunk : "+chunk);
        }
    }
}

When I run it with Java 8 it returns:

Chunk : 
En un lugar
de la Mancha
de cuyo nombre
no quiero acordame

But when I run it with Java 9 the output is different:

Chunk : 
En un lugar
Chunk : de la Mancha
de cuyo nombre
Chunk : no quiero acordame

Why?

Solution:

It was a bug in Java 8 and it got fixed: JDK-8176029 : “Linebreak matcher is not equivalent to the pattern as stated in javadoc”.

Also see: Java-8 regex negative lookbehind with `\R`