Use UTF-8 internally for strings.

I'm sure this has been discussed elsewhere and I don't know when, or if, we'll have time to implement it, but I think it's worth adding here.

Currently Python `str`s (`PyUnicodeObject` in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a header
I propose that we implement them as a [utf8](https://en.wikipedia.org/wiki/UTF-8) encoded array of bytes, plus a header.

The advantages are numerous:
* Non-ASCII strings will generally be more compact, without making ASCII strings any bigger.
* Strings can be joined easily by simple concatenation of the data
* The internet is utf8, so the vast majority of encoding and decoding operations should be fast
* There need only be one implmentation of each `str` method and each `PyUnicode_` C function, saving considerable code size and simplifying the code
* Algorithms for fast operations on utf8 are well known, so many operations will fast, despite the variable length encoding.
* The C struct is clearer, as we don't need an awkward union of `uint8_t`, `uint16_t` and `uint32_t` for character data.

However there are two problems:
* Indexing into strings is supposed to be O(1).
* Some of the C API exposes the internal encoding

### Keeping indexing O(1)

To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings.
This is won't be a problem in practice because:
* Creating the index is no more expensive than creating the current UCS1/2/4 strings.
* We only allocate indexes if we need them, which should be relatively rare.

### The C API

We will have to deprecate, and then remove, the C API that exposes the implementation details.
We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period.

## Implementation

We will probably want to embed the string data directly into the object, so the struct will look something like this:
```C
typedef struct {
    PyObject_HEAD
    uintptr_t interned: 2;   
    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   
    uintptr_t length: (WORD_SIZE-4);    /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PyUnicodeObject;
```
The `valid_utf8` bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8.

### Indexing operations

`setitem(self, index)` would be implemented something like this:

```Py
def getitem(self, index):
     if self.ascii:
         return self.data[index]
    if self.index is NULL:
        self.index = make_index(self)
    offset = offset_from_index(self.index, index)
    return read_one_char(self.data, offset)
```

The index table would be composed of len(s)/64 entries, each entry being:
```C
struct index_entry {
     uintptr_t base_offset;
     uint8_t additional_offset[64];
};
```
With the offset being computed as `base_offset[index/64] + additional_offset[index%64]`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use UTF-8 internally for strings. #684

Keeping indexing O(1)

The C API

Implementation

Indexing operations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use UTF-8 internally for strings. #684

Description

Keeping indexing O(1)

The C API

Implementation

Indexing operations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions