🚨🚨 🚨🚨 [`Tokenizer`] attemp to fix add_token issues🚨🚨 🚨🚨 by ArthurZucker · Pull Request #23909 · huggingface/transformers

ArthurZucker · 2023-05-31T15:03:34Z

What does this PR do?

Adresses a lot of issues related to add_tokens, also adds more refine testing to make sure this does not happen again.

Adding a token with add_tokens ignores the arguments if the token is an AddedToken. reported in AddedToken 's argument are ignored when called in add_tokens 's method of slow tokenizers #20734 and Adding new tokens to various models changes tokenization of adjacent elements in strings #14770, PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120, T5Tokenizer Fast and Slow give different results with AddedTokens #16334
Adding a token does not automatically adds it to the unique_no_split_token. Reported in LLaMATokenizerFast works abnormally #23818 , [Bug]? how does the tokenizer encode the special tokens? #23851, Adding custom tokens makes the T5Tokenizer always strip spaces #11531 but also skip_special_tokens has different behavior between slow and fast tokenizer #23250. Also linked to Add UDOP #22940 , should allow us to re-factor the way T5 tokenizes the inputs (convert_token_to_ids should not have a bunch of regex for special tokens) (also mT5 additional_special_tokens seems not work #9747)
Adding a token that is already in the vocabulary does not add it. This is debatable, but if someone explicitly adds it, it means he does not want to split it. Reported in ”never_split“ not working on BertTokenizer #23459
There is no support for single_word in slow. Reported in Adding new tokens to various models changes tokenization of adjacent elements in strings #14770
Initialising a model from a vocab file does not initialize the Trie. from_pretrained calls added_tokens = tokenizer.sanitize_special_tokens() which is when the tokens are added to no_unique_split. reported in Two tokenizer initialization methods result in inconsistent segmentation results for special words #23930

Fixes #20734, fixes #21120, fixes #16334, fixes #23818 , fixes #23851, fixes #11531 , fixes #9747, fixes #23459 , fixes #14770 , fixes #22935, fixes #23930, fixes #23250, fixes #7901, fixes #19873, fixes #25232, fixes #22414,

Spirit of the refactoring

The main idea is that the PreTrainedTokenizer's __init__ function is responsible for adding all the additional_special tokens, eos_token, etc and creating the token_trie that will be used for splitting the tokens.

All tokens that are added are stored in their AddedToken format in the added_tokens_decoder which becomes the only way to interact with them. The added_tokens_encoder cannot be modified, it is just a conversion of the added_tokens_decoder. The trie is only created based on the added_tokens_decoder. One possible addition is to keep unique_no_split tokens, but I am currently against.
All the added token information now lies in the tokenizer_config.json. Nuking the special_tokens_map.json and added_tokens.json.
Support for lstrip, rstrip and single_word is added. This is only possible because we store the AddedTokens and not only the strings.
Information on which tokens were added is also available for the Fast tokenizers. This is just a representation convenience but was not possible before.
add_special_tokens's replace_additional_special_tokens argument now works.
Remove some of the available surface functions (self.added_tokens_encoder, self.added_tokens_decoder, self.unique_no_split_tokens, etc) and more to come here especial for special tokens,

🚨🚨 Breaking changes 🚨🚨:

unique_no_split_tokens attribute removed and not used in the internal logic
sanitize_special_tokens() follows a deprecation cycle and does nothing
All attributes in SPECIAL_TOKENS_ATTRIBUTES are stored as AddedTokens and no strings.
loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
the length of a tokenizer is now max(set(self.get_vocab().keys())) accounting for holes in the vocab. The vocab_size no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
added_tokens_decoder holds AddedToken, not strings.
add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
initializing a tokenizer form scratch will now add missing special tokens to the vocab.
stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
fairseq_ids_to_tokens attribute removed for Barthez (was not used)

HuggingFaceDocBuilderDev · 2023-05-31T15:20:10Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-05-31T16:36:34Z

TODO: the test should be for all the TokenMixin rather than comparing slow and fast as the behaviour should work in any case.

src/transformers/tokenization_utils.py

tests/test_tokenization_common.py

ArthurZucker · 2023-06-02T15:02:39Z

src/transformers/tokenization_utils.py

+                        token = AddedToken(new_tokens[i].content.lower(), single_word = new_tokens[i].single_word, lstrip = new_tokens[i].lstrip, rstrip = new_tokens[i].rstrip, normalized = new_tokens[i].normalized)
+                    else:


This highlights how bad the API is:

you can't modify the content of the AddedToken

if you only add the content to the unique no split:

when decoding you will have a problem,since decoding uses _additional_special_tokens

when encoding, the rstrip and lstrip etc logic will be ignored

Adding a token does not necessarily update the unique_no_split etc

stephantul · 2023-06-14T10:51:29Z

Is work on this still going on? I'm interested in working on tokenization.

ArthurZucker · 2023-06-14T11:16:35Z

It is! Had to focus on a new model for a while but this is close to being over

stephantul · 2023-06-14T12:12:47Z

Cool! So, I think the best way to do this is that I fork your branch, and then do PRs there? Any changes you like would then propagate to this PR? Is that the preferred way of collaborating?

ArthurZucker · 2023-06-14T12:22:48Z

Haha sorry what I meant is that I’ll take care of this one probably today, tomorrow so no need to dive !
Otherwise forking transformers is always the best solution IMO, then add others as remotes.

stephantul · 2023-06-14T12:28:08Z

Ah ok, my bad! I totally misunderstood. Good luck with the PR 😸

…to fix-add-tokens

…23909) * fix test for bart. Order is correct now let's skip BPEs * ouf * styling * fix bert.... * slow refactoring * current updates * massive refactoring * update * NICE! * update to see where I am at * updates * update * update * revert * updates * updates * start supporting legacy_save * styling * big update * revert some changes * nits * nniiiiiice * small fixes * kinda fix t5 with new behaviour * major update * fixup * fix copies * today's updates * fix byt5 * upfate * update * update * updates * update vocab size test * Barthez does not use not need the fairseq offset ids * super calll must be after * calll super * move all super init * move other super init * fixup * nits * more fixes * nits * more fixes * nits * more fix * remove useless files * ouch all of them are affected * and more! * small imporvements * no more sanitize token * more changes around unique no split tokens * partially fix more things * keep legacy save but add warning * so... more fixes * updates * guess deberta tokenizer could be nuked * fixup * fixup did some bad things * nuke it if it breaks * remove prints and pretrain fast from slow with new format. * fixups * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fiou * nit * by default specials should not be normalized? * update * remove brakpoint * updates * a lot of updates * fixup * fixes revert some changes to match fast * small nits * that makes it cleaner * fix camembert accordingly * update * some lest breaking changes * update * fixup * fix byt5 and whisper mostly * some more fixes, canine's byte vocab * fix gpt2 * fix most of the perceiver tests (4 left) * fix layout lmv3 * fixup * fix copies for gpt2 style * make sure to only warn once * fix perciever and gpt2 tests * some more backward compatibility: also read special tokens map because some ppl use it........////..... * fixup * add else when reading * nits * fresh updates * fix copies * will this make everything faster? * fixes * more fixes * update * more fixes * fixup * is the source of truth right? * sorry camembert for the troubles * current updates * fixup * update led * update * fix regression * fix single word * more model specific fixes * fix t5 tests * fixup * more comments * update * fix nllb * rstrip removed * small fixes * better handle additional_special_tokens and vocab sizes * fixing * styling * fix 4 / 21 * fixup * fix nlbb's tests * some fixes * fix t5 * fixes * style * fix canine tests * damn this is nice * nits * m2m100 nit * fixups * fixes! * fixup * stash * fix merge * revert bad change * fixup * correct order for code Llama * fix speecht5 post merge * styling * revert source of 11 fails * small nits * all changes in one go * fnet hack * fix 2 more tests * update based on main branch of tokenizers * fixup * fix VITS issues * more fixes * fix mgp test * fix camembert issues * oups camembert still has 2 failing tests * mluke fixes * decode fixes * small nits * nits * fix llama and vits * fix camembert * smal nits * more fixes when initialising a fast from a slow and etc * fix one of the last test * fix CPM tokenizer test * fixups * fix pop2piano * fixup * ⚠️ Change tokenizers required version ⚠️ * ⚠️ Change tokenizers required version ⚠️ * "tokenizers>=0.14,<0.15", don't forget smaller than * fix musicgen tests and pretraiendtokenizerfast * fix owlvit and all * update t5 * fix 800 red * fix tests * fix the fix of the fix of t5 * styling * documentation nits * cache _added_tokens_encoder * fixups * Nit * fix red tests * one last nit! * make eveything a lot simpler * Now it's over 😉 * few small nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates that work for now * tests that should no be skipped / changed and fixed next * fixup * i am ashamed * pushe the fix * update * fixups * nits * fix added_tokens_encoder * fix canine test * fix pegasus vocab * fix transfoXL * fixup * whisper needs to be fixed for train new * pegasus nits * more pegasus fixes * minor update * better error message in failed test * fix whisper failing test * fix whisper failing test * fix pegasus * fixup * fix **** pegasus * reset things * remove another file * attempts to fix the strange custome encoder and offset * nits here and there * update * fixup * nit * fix the whisper test * nits nits * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * updates based on review * some small update to potentially remove * nits * import rlu cache * Update src/transformers/tokenization_utils_base.py Co-authored-by: Lysandre Debut <hi@lysand.re> * move warning to `from_pretrained` * update tests results now that the special tokens are always added --------- Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Lysandre Debut <hi@lysand.re>

LoicDagnas · 2023-10-09T18:59:39Z

@ArthurZucker I am not sure to understand the whole scope of this PR, but does it means that maintainer of such model have the responsibility to update the added_token.json file to still have their tokenizer usable with AutoTokenizer.from_pretrained?

In other words, is AutoTokenizer.from_pretrained("iarfmoose/t5-base-question-generator") supposed to fail now here

transformers/src/transformers/models/t5/tokenization_t5.py

Line 171 in e4dad4f

raise ValueError(

?

ArthurZucker · 2023-10-09T21:23:05Z

No it’s not! It might right now but the goal is to keep forward compatibility.
However long term wise authors should update the repo especially if they want to have the same behavior between fast and slow

v4.34.0 release did a complete refactor of the tokenizer module, see: huggingface/transformers#23909 Something about the difference is causing vila to produce literally billions of lines of log warning messages to Datadog in prod. I don't know if these warnings are meaningful, but they are expensive.

v4.34.0 release did a complete refactor of the tokenizer module, see: huggingface/transformers#23909 Something about the difference is causing vila to produce literally billions of lines of log warning messages to Datadog in prod. I don't know if these warnings are meaningful, but they are expensive. Example logs: https://app.datadoghq.com/logs?query=service%3Avila-v0%20&cols=host%2Cservice&index=%2A&messageDisplay=inline&refresh_mode=paused&stream_sort=desc&viz=stream&from_ts=1697556761689&to_ts=1697557153857&live=false

LoicDagnas · 2023-10-30T13:48:10Z

Hello @ArthurZucker, about my comment above, with the latest release 4.34.1 it doesn't fail anymore in tokenization_t5.py but here

transformers/src/transformers/models/t5/tokenization_t5_fast.py

Line 127 in e4dad4f

raise ValueError(

Do you want me to open an issue?

amyeroberts · 2023-10-30T14:52:35Z

Hi @LoicDagnas, @ArthurZucker is off for this week so won't be able to address until then. Yes, please create a separate issue, linking to this one with details on what's happening and a code snippet to reproduce.

ArthurZucker · 2023-11-06T09:03:05Z

I answered on the issue but it's fixed and part of the latest release! 🤗

ArthurZucker changed the title ~~[Tokenizer] attemp to fix add_token issues~~ [WIP][Tokenizer] attemp to fix add_token issues May 31, 2023

mnoukhov mentioned this pull request Jun 1, 2023

StackLlama: fixed RL training and added args huggingface/trl#400

Merged

vpegasus mentioned this pull request Jun 2, 2023

[Bug]? how does the tokenizer encode the special tokens? huggingface/tokenizers#1263

Closed

ArthurZucker commented Jun 2, 2023

View reviewed changes

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

ArthurZucker commented Jun 2, 2023

View reviewed changes

src/transformers/tokenization_utils.py Outdated Show resolved Hide resolved

ArthurZucker commented Jun 2, 2023

View reviewed changes

tests/test_tokenization_common.py Outdated Show resolved Hide resolved

ArthurZucker added 4 commits June 2, 2023 14:06

fix test for bart. Order is correct now let's skip BPEs

4aea714

ouf

0df212f

styling

c04bd58

fix bert....

4a54920

ArthurZucker commented Jun 2, 2023

View reviewed changes

slow refactoring

189b259

ArthurZucker linked an issue Jun 7, 2023 that may be closed by this pull request

LLaMATokenizerFast works abnormally #23818

Closed

4 tasks

ArthurZucker removed a link to an issue Jun 7, 2023

LLaMATokenizerFast works abnormally #23818

Closed

4 tasks

current updates

f1c362a

ArthurZucker mentioned this pull request Jun 14, 2023

[TokenizerSlow] replace_additional_special_tokens is not doing much #24276

Closed

Merge branch 'main' of https://github.com/huggingface/transformers in…

f5b178a

…to fix-add-tokens

ArthurZucker mentioned this pull request Jun 16, 2023

[Tokenizer] skip_special_tokens not working as expected #24316

Closed

ArthurZucker linked an issue Jun 16, 2023 that may be closed by this pull request

[Tokenizer] skip_special_tokens not working as expected #24316

Closed

sanchit-gandhi mentioned this pull request Sep 22, 2023

[Wav2Vec2] Fix tokenizer set lang #26349

Merged

Spico197 mentioned this pull request Sep 22, 2023

fix tokenizer initialization bug with the latest version (4.34.0.dev0) of transformers OpenNLG/OpenBA#1

Closed

NielsRogge mentioned this pull request Sep 22, 2023

Add Nougat #25942

Merged

6 tasks

YYYasin19 added a commit to YYYasin19/transformers-feedstock that referenced this pull request Oct 7, 2023

fix tokenizer version based on huggingface/transformers/pull/23909

6173bee

ArthurZucker deleted the fix-add-tokens branch October 9, 2023 21:23

ydshieh mentioned this pull request Oct 12, 2023

Fix PerceiverModelIntegrationTest::test_inference_masked_lm #26760

Merged

cmwilhelm mentioned this pull request Oct 17, 2023

Caps max allowed version of transformers for vila. allenai/mmda#279

Merged

ArthurZucker mentioned this pull request Oct 19, 2023

tokenizer : special token handling ggml-org/llama.cpp#3538

Merged

5 tasks

jploski referenced this pull request in ggml-org/llama.cpp Oct 22, 2023

Clarify logic in conversion

74204cc

jploski mentioned this pull request Oct 23, 2023

support loading vocab from fast tokenizer config in convert.py ggml-org/llama.cpp#3633

Merged

LoicDagnas mentioned this pull request Oct 30, 2023

Fast T5 tokenization fails for models with additional special tokens not prefixed by extra_id_ #27150

Closed

4 tasks

jonatanklosko mentioned this pull request Nov 13, 2023

Support DeepSeek Coder Model elixir-nx/bumblebee#278

Closed

mingboiz mentioned this pull request Dec 12, 2023

sudachitra and other custom tokenizers no longer compatible with transformers later than 4.34 WorksApplications/SudachiTra#66

Closed

ArthurZucker mentioned this pull request Jan 3, 2024

[MBart50] Inconsistent decoding with additional special tokens between slow and fast tokenizers #28287

Closed

4 tasks

Apsod mentioned this pull request Jan 3, 2024

Remove token_type_ids from model_input_names (like #24788) #28325

Merged

5 tasks

ArthurZucker mentioned this pull request Jan 3, 2024

loading added_tokens.json huggingface/tokenizers#1422

Closed

amyeroberts mentioned this pull request Jan 5, 2024

Whisper v3 dependency issue #28156

Closed

4 tasks

shuttie mentioned this pull request Jan 12, 2024

Tokenizer.save_pretrained fails when add_special_tokens=True|False #28472

Closed

4 tasks

tlapusan mentioned this pull request Jan 31, 2024

Adding new tokens to various models changes tokenization of adjacent elements in strings #14770

Closed

Nice-try-zzw mentioned this pull request Jun 6, 2024

【昇腾AI创新大赛】switch_transformers mindspore-lab/mindnlp#1190

Merged

		token = AddedToken(new_tokens[i].content.lower(), single_word = new_tokens[i].single_word, lstrip = new_tokens[i].lstrip, rstrip = new_tokens[i].rstrip, normalized = new_tokens[i].normalized)
		else:

Conversation

ArthurZucker commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Spirit of the refactoring

🚨🚨 Breaking changes 🚨🚨:

Uh oh!

HuggingFaceDocBuilderDev commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Jun 2, 2023

Choose a reason for hiding this comment

Uh oh!

stephantul commented Jun 14, 2023

Uh oh!

ArthurZucker commented Jun 14, 2023

Uh oh!

stephantul commented Jun 14, 2023

Uh oh!

ArthurZucker commented Jun 14, 2023

Uh oh!

stephantul commented Jun 14, 2023

Uh oh!

LoicDagnas commented Oct 9, 2023

Uh oh!

ArthurZucker commented Oct 9, 2023

Uh oh!

LoicDagnas commented Oct 30, 2023

Uh oh!

amyeroberts commented Oct 30, 2023

Uh oh!

ArthurZucker commented Nov 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ArthurZucker commented May 31, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 31, 2023 •

edited

Loading

ArthurZucker commented May 31, 2023 •

edited

Loading