SPLIT PR: add user defined symbols and control symbols#31305
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
97d0d35 to
befadd1
Compare
…en, so a space is not added
befadd1 to
904121e
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM! I think we should check the type of the defined symbols see this comment offline:
if piece.type == 4:
tokens_to_add += [piece.piece]
regarding trainer_spec.user_defined_symbols not giving the same results
| AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols | ||
| ] | ||
|
|
||
| tokenizer.add_tokens(user_defined_symbols) |
There was a problem hiding this comment.
Cool! Related fix to only add once: huggingface/tokenizers@f2ec3b2
There was a problem hiding this comment.
maybe using map will be faster? but anyway it's good enough fro what we want !
There was a problem hiding this comment.
why did you change the encoded sequence here?
…e instead of trainer_spec for user_defined_symbols
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for iterating! Last nit and good to go
| # Add user defined symbols | ||
| user_defined_symbols = [ | ||
| AddedToken(token, normalized=False, special=False) | ||
| for token in [p.piece for p in self.proto.pieces if p.type == 4] |
There was a problem hiding this comment.
let's add a comment that references the documentation of sentencepiece as to why we use 4 (arbitrary number)
| AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols | ||
| ] | ||
|
|
||
| tokenizer.add_tokens(user_defined_symbols + control_symbols) |
There was a problem hiding this comment.
Ok this will work with the release of tokenizers, so fine for me! Good job
There was a problem hiding this comment.
we can also propagate to gemma! (only has this for now:
user_defined_symbols = [
AddedToken(token, normalized=True, special=False) for token in proto.trainer_spec.user_defined_symbols
]
tokenizer.add_tokens(user_defined_symbols)
Fixes portion of #30824 ->
user_defined_symbolsandcontrol_symbols. @ArthurZucker thoughts on this?'.'in test cases' texts because it is a user added token for fast tokenizers, so the spacing around it differs from slow.add_special_tokensandadd_tokenshas same effect for now - but fix is already merged inTokenizersby @ArthurZucker remove enforcement of non special when adding tokens tokenizers#1521TODO in new PR:
SPMConverterdoes not always add the user defined symbol -> slow fast is thus not equivalent #30824) -> PR open: SPLIT PR: add_prefix_space fix #31315