Skip to content

Use UTF-8 internally for strings. #684

@markshannon

Description

@markshannon

I'm sure this has been discussed elsewhere and I don't know when, or if, we'll have time to implement it, but I think it's worth adding here.

Currently Python strs (PyUnicodeObject in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a header
I propose that we implement them as a utf8 encoded array of bytes, plus a header.

The advantages are numerous:

  • Non-ASCII strings will generally be more compact, without making ASCII strings any bigger.
  • Strings can be joined easily by simple concatenation of the data
  • The internet is utf8, so the vast majority of encoding and decoding operations should be fast
  • There need only be one implmentation of each str method and each PyUnicode_ C function, saving considerable code size and simplifying the code
  • Algorithms for fast operations on utf8 are well known, so many operations will fast, despite the variable length encoding.
  • The C struct is clearer, as we don't need an awkward union of uint8_t, uint16_t and uint32_t for character data.

However there are two problems:

  • Indexing into strings is supposed to be O(1).
  • Some of the C API exposes the internal encoding

Keeping indexing O(1)

To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings.
This is won't be a problem in practice because:

  • Creating the index is no more expensive than creating the current UCS1/2/4 strings.
  • We only allocate indexes if we need them, which should be relatively rare.

The C API

We will have to deprecate, and then remove, the C API that exposes the implementation details.
We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period.

Implementation

We will probably want to embed the string data directly into the object, so the struct will look something like this:

typedef struct {
    PyObject_HEAD
    uintptr_t interned: 2;   
    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   
    uintptr_t length: (WORD_SIZE-4);    /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PyUnicodeObject;

The valid_utf8 bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8.

Indexing operations

setitem(self, index) would be implemented something like this:

def getitem(self, index):
     if self.ascii:
         return self.data[index]
    if self.index is NULL:
        self.index = make_index(self)
    offset = offset_from_index(self.index, index)
    return read_one_char(self.data, offset)

The index table would be composed of len(s)/64 entries, each entry being:

struct index_entry {
     uintptr_t base_offset;
     uint8_t additional_offset[64];
};

With the offset being computed as base_offset[index/64] + additional_offset[index%64].

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions