Skip to content

Whitespace naming misleading #38180

@phillipeloher

Description

@phillipeloher

System Info

The pre_tokenizer function Whitespace() has a misleading name that sent me on a multi-hour treasure hunt trying to troubleshoot vocabulary issues. As it turns out, it additionally splits on non-Whitespace characters like e.g. Hyphens. It took looking at Rust code and/or following cryptic RegEx expressions in the Whitespace documentation to troubleshoot this. A new pre-tokenizer name and better/explicit documentation is highly recommended. Thank you.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  • Create a hugging face tokenizer
  • Add the Whitespace tokenizer from tokenizers.pre_tokenizers
  • See that it splits on non-whitespace like hyphens

Expected behavior

If it's called Whitespace, only split on Whitespace. Or change the name. Documentation is also scarce, should include more than a RegEx expression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions