-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Closed
Labels
Description
System Info
The pre_tokenizer function Whitespace() has a misleading name that sent me on a multi-hour treasure hunt trying to troubleshoot vocabulary issues. As it turns out, it additionally splits on non-Whitespace characters like e.g. Hyphens. It took looking at Rust code and/or following cryptic RegEx expressions in the Whitespace documentation to troubleshoot this. A new pre-tokenizer name and better/explicit documentation is highly recommended. Thank you.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Create a hugging face tokenizer
- Add the Whitespace tokenizer from tokenizers.pre_tokenizers
- See that it splits on non-whitespace like hyphens
Expected behavior
If it's called Whitespace, only split on Whitespace. Or change the name. Documentation is also scarce, should include more than a RegEx expression.
Reactions are currently unavailable