Conversation
| func tokenize(text: String) -> [String] { | ||
| let splitTokens = text.folding(options: .diacriticInsensitive, locale: nil) | ||
| .components(separatedBy: NSCharacterSet.whitespaces) | ||
| let splitTokens = text.components(separatedBy: NSCharacterSet.whitespaces) |
There was a problem hiding this comment.
lowercase normalization should have happened before reaching this pint
| self.shouldHandleChineseChars = config.handleChineseChars?.boolValue ?? true | ||
| self.shouldStripAccents = config.stripAccents?.boolValue | ||
| self.shouldLowercase = config.lowercase?.boolValue ?? true | ||
| self.shouldStripAccents = config.stripAccents?.boolValue ?? shouldLowercase |
There was a problem hiding this comment.
| text.decomposedStringWithCanonicalMapping | ||
| .filter { | ||
| $0.unicodeScalars.allSatisfy { scalar in | ||
| !(0x0300 <= scalar.value && scalar.value <= 0x036F) | ||
| } | ||
| } | ||
| // This might be the same as `text.folding(options: .diacriticInsensitive, locale: nil)` | ||
| String(text.decomposedStringWithCanonicalMapping.unicodeScalars.filter { scalar in | ||
| !(0x0300 <= scalar.value && scalar.value <= 0x036F) | ||
| }) |
There was a problem hiding this comment.
This was one of the main problems - stripAccents was removing accented chars, but we should only remove the diacritic marks.
And stripAccents is performed when lowercasing
|
Thanks for reporting #156, @piotrkowalczuk! There were several inconsistencies that should be resolved now, and I added much more comprehensive tests. I double-checked the test cases against the Python and Rust implementations, and some of the most confusing ones against the original codebase. Note there's still an inconsistency with respect to the "fast" Rust-based transformers tokenizers, when decoding tokenized data: Fast tokenizer: >>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> t.decode(t.encode("l'eure"), skip_special_tokens=True)
"l ' eure"Slow (Python based) tokenizer: >>> t = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", use_fast=False)
>>> t.decode(t.encode("l'eure"), skip_special_tokens=True)
"l'eure"We believe the "slow" interpretation to be correct, and it's the same approach @xenova took for transformers.js. There's no decoding code in the reference implementation, but I believe the intention was to strip those characters. Could you please give it a final look / test drive before merging @piotrkowalczuk? cc @Vaibhavs10 because this was fun. |
|
LGTM. I was able to integrate your branch into my app. All the regression cases worked fine. Thanks @pcuenca. |
|
Cool, merging then! |
Fixes #156