[Tokenizer] Fix slow and fast serialization#26570
Merged
ArthurZucker merged 114 commits intohuggingface:mainfrom Oct 18, 2023
Merged
[Tokenizer] Fix slow and fast serialization#26570ArthurZucker merged 114 commits intohuggingface:mainfrom
Tokenizer] Fix slow and fast serialization#26570ArthurZucker merged 114 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
4 tasks
4 tasks
LysandreJik
approved these changes
Oct 18, 2023
Member
LysandreJik
left a comment
There was a problem hiding this comment.
Feel free to merge when ready as seen offline with you
|
I ran into the error below So I added some prints and get this intermediate values: According to the output, I made a fix, which seemed to work out: # Passing AddedTokens and not strings to the class to prevent it from casting the string to a different AddedToken
for key in cls.SPECIAL_TOKENS_ATTRIBUTES & init_kwargs.keys():
if added_tokens_map != {} and init_kwargs[key] is not None:
if key != "additional_special_tokens":
# >>> debug
def print_info(name, obj):
print(f"{name}: ({type(obj).__name__}){obj}")
print_info("cls.SPECIAL_TOKENS_ATTRIBUTES", cls.SPECIAL_TOKENS_ATTRIBUTES)
print_info("added_tokens_map", added_tokens_map)
print_info("init_kwargs", init_kwargs)
print_info("key", key)
print_info("init_kwargs[key]", init_kwargs[key])
# <<< debug
- init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key])
+ init_kwargs[key] = added_tokens_map.get(key, init_kwargs[key]) # fix |
Collaborator
Author
|
Could you share a reproducer? Would help me a lot as well! |
Sorry that I'm too busy to do so right now 😭 But this only happened when I loaded the tokenizer of Llemma-7B. I hope this description could help you reproduce the error. |
EduardoPach
pushed a commit
to EduardoPach/transformers
that referenced
this pull request
Nov 19, 2023
* fix * last attempt * current work * fix forward compatibility * save all special tokens * current state * revert additional changes * updates * remove tokenizer.model * add a test and the fix * nit * revert one more break * fix typefield issue * quality * more tests * fix fields for FC * more nits? * new additional changes * how * some updates * simplify all * more nits * revert some things to original * nice * nits * a small hack * more nits * ahhaha * fixup * update * make test run on ci * use subtesting * update * Update .circleci/create_circleci_config.py * updates * fixup * nits * replace typo * fix the test * nits * update * None max dif pls * a partial fix * had to revert one thing * test the fast * updates * fixup * and more nits * more fixes * update * Oupsy 👁️ * nits * fix marian * on our way to heaven * Update src/transformers/models/t5/tokenization_t5.py Co-authored-by: Lysandre Debut <hi@lysand.re> * fixup * Update src/transformers/tokenization_utils_fast.py Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com> * Update src/transformers/tokenization_utils_base.py Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com> * fix phobert * skip some things, test more * nits * fixup * fix deberta * update * update * more updates * skip one test * more updates * fix camembert * can't test this one * more good fixes * kind of a major update - seperate what is only done in fast in fast init and refactor - add_token(AddedToken(..., speicla = True)) ignores it in fast - better loading * fixup * more fixups * fix pegasus and mpnet * remove skipped tests * fix phoneme tokenizer if self.verbose * fix individual models * update common tests * update testing files * all over again * nits * skip test for markup lm * fixups * fix order of addition in fast by sorting the added tokens decoder * proper defaults for deberta * correct default for fnet * nits on add tokens, string initialized to special if special * skip irrelevant herbert tests * main fixes * update test added_tokens_serialization * the fix for bart like models and class instanciating * update bart * nit! * update idefix test * fix whisper! * some fixup * fixups * revert some of the wrong chanegs * fixup * fixup * skip marian * skip the correct tests * skip for tf and flax as well --------- Co-authored-by: Lysandre Debut <hi@lysand.re> Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
added_tokens.jsonfile: a recent push made it save all the added tokens encoder, but it should only save the indexes greater than the vocab size for forward compatibility.additionnal_special_tokensthat were added twice / overwrittenadd_tokens: if the added tokens is a string we check if it's not already in the added vocab instead of always defaulting to strip left or right."__type ":"AddedToken"field to the added tokens otherwise the previous versions of transformers will try to load them.fixes #26732, fixes #26775, fixes #26773, fixes #26768, fixes #26859