The SPM converter (and all that inherit / use it) does not take into account the user defined symbols, leading to issues like this one: https://huggingface.co/01-ai/Yi-9B/discussions/11#6643844f6ac3fe108e3d5190
It also does not really take into account the prefix space param which can and should be extracted from the proto:
split_by_unicode_script: true
split_by_number: true
split_by_whitespace: true
treat_whitespace_as_suffix: false
allow_whitespace_only_pieces: true
split_digits: true
and
normalizer_spec {
name: "identity"
precompiled_charsmap: ""
add_dummy_prefix: false
remove_extra_whitespaces: false
normalization_rule_tsv: ""
}
cc @itazap, on a more general converter!
The
SPMconverter (and all that inherit / use it) does not take into account the user defined symbols, leading to issues like this one: https://huggingface.co/01-ai/Yi-9B/discussions/11#6643844f6ac3fe108e3d5190It also does not really take into account the prefix space param which can and should be extracted from the proto:
and
cc @itazap, on a more general converter!