Skip to content

Bert fixes#157

Merged
pcuenca merged 15 commits intomainfrom
bert-fixes
Jan 17, 2025
Merged

Bert fixes#157
pcuenca merged 15 commits intomainfrom
bert-fixes

Conversation

@pcuenca
Copy link
Member

@pcuenca pcuenca commented Jan 16, 2025

Fixes #156

@pcuenca pcuenca marked this pull request as draft January 16, 2025 20:46
func tokenize(text: String) -> [String] {
let splitTokens = text.folding(options: .diacriticInsensitive, locale: nil)
.components(separatedBy: NSCharacterSet.whitespaces)
let splitTokens = text.components(separatedBy: NSCharacterSet.whitespaces)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase normalization should have happened before reaching this pint

self.shouldHandleChineseChars = config.handleChineseChars?.boolValue ?? true
self.shouldStripAccents = config.stripAccents?.boolValue
self.shouldLowercase = config.lowercase?.boolValue ?? true
self.shouldStripAccents = config.stripAccents?.boolValue ?? shouldLowercase
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -222 to +225
text.decomposedStringWithCanonicalMapping
.filter {
$0.unicodeScalars.allSatisfy { scalar in
!(0x0300 <= scalar.value && scalar.value <= 0x036F)
}
}
// This might be the same as `text.folding(options: .diacriticInsensitive, locale: nil)`
String(text.decomposedStringWithCanonicalMapping.unicodeScalars.filter { scalar in
!(0x0300 <= scalar.value && scalar.value <= 0x036F)
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of the main problems - stripAccents was removing accented chars, but we should only remove the diacritic marks.

@pcuenca
Copy link
Member Author

pcuenca commented Jan 17, 2025

#156 (comment)

@pcuenca pcuenca marked this pull request as ready for review January 17, 2025 16:20
@pcuenca
Copy link
Member Author

pcuenca commented Jan 17, 2025

Thanks for reporting #156, @piotrkowalczuk! There were several inconsistencies that should be resolved now, and I added much more comprehensive tests. I double-checked the test cases against the Python and Rust implementations, and some of the most confusing ones against the original codebase.

Note there's still an inconsistency with respect to the "fast" Rust-based transformers tokenizers, when decoding tokenized data:

Fast tokenizer:

>>> from transformers import AutoTokenizer
>>> t = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> t.decode(t.encode("l'eure"), skip_special_tokens=True)
"l ' eure"

Slow (Python based) tokenizer:

>>> t = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased", use_fast=False)
>>> t.decode(t.encode("l'eure"), skip_special_tokens=True)
"l'eure"

We believe the "slow" interpretation to be correct, and it's the same approach @xenova took for transformers.js. There's no decoding code in the reference implementation, but I believe the intention was to strip those characters.


Could you please give it a final look / test drive before merging @piotrkowalczuk?

cc @Vaibhavs10 because this was fun.

@piotrkowalczuk
Copy link
Contributor

LGTM. I was able to integrate your branch into my app. All the regression cases worked fine. Thanks @pcuenca.

@pcuenca
Copy link
Member Author

pcuenca commented Jan 17, 2025

Cool, merging then!

@pcuenca pcuenca merged commit 313fbd7 into main Jan 17, 2025
1 check passed
@pcuenca pcuenca deleted the bert-fixes branch January 17, 2025 17:20
@pcuenca pcuenca mentioned this pull request Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BertTokenizer outputs differ between cased and uncased pretrained tokenizers

2 participants