-
Notifications
You must be signed in to change notification settings - Fork 53
Description
I'm sure this has been discussed elsewhere and I don't know when, or if, we'll have time to implement it, but I think it's worth adding here.
Currently Python strs (PyUnicodeObject in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a header
I propose that we implement them as a utf8 encoded array of bytes, plus a header.
The advantages are numerous:
- Non-ASCII strings will generally be more compact, without making ASCII strings any bigger.
- Strings can be joined easily by simple concatenation of the data
- The internet is utf8, so the vast majority of encoding and decoding operations should be fast
- There need only be one implmentation of each
strmethod and eachPyUnicode_C function, saving considerable code size and simplifying the code - Algorithms for fast operations on utf8 are well known, so many operations will fast, despite the variable length encoding.
- The C struct is clearer, as we don't need an awkward union of
uint8_t,uint16_tanduint32_tfor character data.
However there are two problems:
- Indexing into strings is supposed to be O(1).
- Some of the C API exposes the internal encoding
Keeping indexing O(1)
To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings.
This is won't be a problem in practice because:
- Creating the index is no more expensive than creating the current UCS1/2/4 strings.
- We only allocate indexes if we need them, which should be relatively rare.
The C API
We will have to deprecate, and then remove, the C API that exposes the implementation details.
We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period.
Implementation
We will probably want to embed the string data directly into the object, so the struct will look something like this:
typedef struct {
PyObject_HEAD
uintptr_t interned: 2;
uintptr_t ascii: 1;
uintptr_t valid_utf8: 1;
uintptr_t length: (WORD_SIZE-4); /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
PyUnicodeIndex *index; /* NULL unless needed */
uint8_t data[1];
} PyUnicodeObject;The valid_utf8 bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8.
Indexing operations
setitem(self, index) would be implemented something like this:
def getitem(self, index):
if self.ascii:
return self.data[index]
if self.index is NULL:
self.index = make_index(self)
offset = offset_from_index(self.index, index)
return read_one_char(self.data, offset)The index table would be composed of len(s)/64 entries, each entry being:
struct index_entry {
uintptr_t base_offset;
uint8_t additional_offset[64];
};With the offset being computed as base_offset[index/64] + additional_offset[index%64].