Support for Unicode Fuzzy suggestions

Right now the trie and the Levenshtein automaton that handle prefix searches, operate on `char`s. This means that while we can store and search for prefixes using utf-8 non-latin strings (or any encoding for that matter), fuzzy matching will not work, because levenshtein distance is calculated in bytes and not codepoints / letters. 

We need to operate on unicode runes, and not on chars. However, doing this with variable length encoding like utf-8 is difficult in the current implementation of the trie, and will probably reduce performance considerably.

### In order to support this, the following steps will be needed:
1. The trie and levenshtein automaton will be converted to operate on unicode wide characters or runes. 
2. However, to avoid having to hold 32-bits per letter in the trie, which will consume tons of memory, we'll use 16-bit unicode runes, ignoring anything that is not contained in [Unicode BMP](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane). This will work for all modern languages.
3. The Levenshtein automaton will be converted to use 16-bit runes as well. Thus evaluation of any input string will have the correct results in terms of Levenshtein distance. (this means converting the DFA from using a 255 byte pointer array to the next state node, to a sparse array of pointers. This will slow down searches a bit, but can be sped up with sorting the array and using binary search.

### Converting text
1. We assume all input to FT.SUG* commands is valid utf-8. 
2. We convert the input strings to 32-bit unicode, optionally normalizing, case-folding and removing accents on the way. If the conversion fails it's because the input is not valid utf-8.
3. We trim the 32-bit runes to 16-bit runes using the lower 16 bits. These can be used for insertion, deletion and search.
4. We convert the output of searches back to utf-8. 

### Memory Penalty

The above will make the module consume more memory. However, the penalty will not be too bad:
1. If all the strings in the trie are non-latin utf-8, there won't be much change, since these are already expressed as 2 or 3 bytes in utf-8.

2. If all strings are latin ascii - while the text representations in the trie will take exactly X2 more memory - in reality, the penalty will be a lot less: Depending on the fragmentation of the tree, anywhere from 20 to 80% of the memory is pointers and metadata. Thus doubling the memory on the rest of the data will not double the memory consumption directly.

Any mix of 1 and 2 will yield a result in between.

3. The Levenshtein automaton will take up more memory, but this will be negligible.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Unicode Fuzzy suggestions #9

In order to support this, the following steps will be needed:

Converting text

Memory Penalty

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for Unicode Fuzzy suggestions #9

Description

In order to support this, the following steps will be needed:

Converting text

Memory Penalty

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions