-
Notifications
You must be signed in to change notification settings - Fork 578
Support for Unicode Fuzzy suggestions #9
Description
Right now the trie and the Levenshtein automaton that handle prefix searches, operate on chars. This means that while we can store and search for prefixes using utf-8 non-latin strings (or any encoding for that matter), fuzzy matching will not work, because levenshtein distance is calculated in bytes and not codepoints / letters.
We need to operate on unicode runes, and not on chars. However, doing this with variable length encoding like utf-8 is difficult in the current implementation of the trie, and will probably reduce performance considerably.
In order to support this, the following steps will be needed:
- The trie and levenshtein automaton will be converted to operate on unicode wide characters or runes.
- However, to avoid having to hold 32-bits per letter in the trie, which will consume tons of memory, we'll use 16-bit unicode runes, ignoring anything that is not contained in Unicode BMP. This will work for all modern languages.
- The Levenshtein automaton will be converted to use 16-bit runes as well. Thus evaluation of any input string will have the correct results in terms of Levenshtein distance. (this means converting the DFA from using a 255 byte pointer array to the next state node, to a sparse array of pointers. This will slow down searches a bit, but can be sped up with sorting the array and using binary search.
Converting text
- We assume all input to FT.SUG* commands is valid utf-8.
- We convert the input strings to 32-bit unicode, optionally normalizing, case-folding and removing accents on the way. If the conversion fails it's because the input is not valid utf-8.
- We trim the 32-bit runes to 16-bit runes using the lower 16 bits. These can be used for insertion, deletion and search.
- We convert the output of searches back to utf-8.
Memory Penalty
The above will make the module consume more memory. However, the penalty will not be too bad:
-
If all the strings in the trie are non-latin utf-8, there won't be much change, since these are already expressed as 2 or 3 bytes in utf-8.
-
If all strings are latin ascii - while the text representations in the trie will take exactly X2 more memory - in reality, the penalty will be a lot less: Depending on the fragmentation of the tree, anywhere from 20 to 80% of the memory is pointers and metadata. Thus doubling the memory on the rest of the data will not double the memory consumption directly.
Any mix of 1 and 2 will yield a result in between.
- The Levenshtein automaton will take up more memory, but this will be negligible.