Skip to content

SPMConverter does not always add the user defined symbol -> slow fast is thus not equivalent #30824

@ArthurZucker

Description

@ArthurZucker

The SPM converter (and all that inherit / use it) does not take into account the user defined symbols, leading to issues like this one: https://huggingface.co/01-ai/Yi-9B/discussions/11#6643844f6ac3fe108e3d5190

It also does not really take into account the prefix space param which can and should be extracted from the proto:

split_by_unicode_script: true
split_by_number: true
split_by_whitespace: true
treat_whitespace_as_suffix: false
allow_whitespace_only_pieces: true
split_digits: true

and

normalizer_spec {
  name: "identity"
  precompiled_charsmap: ""
  add_dummy_prefix: false
  remove_extra_whitespaces: false
  normalization_rule_tsv: ""
}

cc @itazap, on a more general converter!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions