Whitespace naming misleading

### System Info

The pre_tokenizer function Whitespace() has a misleading name that sent me on a multi-hour treasure hunt trying to troubleshoot vocabulary issues.  As it turns out, it additionally splits on non-Whitespace characters like e.g. Hyphens.  It took looking at Rust code and/or following cryptic RegEx expressions in the Whitespace documentation to troubleshoot this.  A new pre-tokenizer name and better/explicit documentation is highly recommended.  Thank you.  

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

- Create a hugging face tokenizer
- Add the Whitespace tokenizer from tokenizers.pre_tokenizers
- See that it splits on non-whitespace like hyphens

### Expected behavior

If it's called Whitespace, only split on Whitespace.  Or change the name.  Documentation is also scarce, should include more than a RegEx expression. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace naming misleading #38180

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whitespace naming misleading #38180

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions