SPLIT PR: add_prefix_space fix by itazap · Pull Request #31315 · huggingface/transformers

itazap · 2024-06-07T14:30:00Z

Fixes portion of #30824

Adds support for add_prefix_space.

TASKS:

convert_slow_tokenizer.py

add_prefix_space is set based on original_tokenizer, otherwise based on proto's noramlizer.

llama_tokenizer.py

add_prefix_space is updated based on normalizer if unset.

llama_tokenizer_fast.py

remove forcing from_slow conversion if add_prefix_space is not set (see tokenization_utils_fast.py updates on how this is handled without conversion). This allows passing this field wthout sentencepiece installed.

tokenization_seamless_m4t.py

updated comment 'Copies' - ruff complained

tokenization_t5.py

do not set prefix to True, set based on normalizer

tokenization_t5_fast.py

set prefix space, do not force from slow if set

tokenization_utils.fast.py

set prefix space
allow using add_prefix_space without converting from slow, thus allowing to use without sentencepiece installed
update rust pre_tokenizer and normalizer based on add_prefix_space and underlying normalizer. legacy logic copied from convert_from_slow.py

TESTS:

test legacy tokenizer in llama
test legacy tokenizer in t5
test tokenizer without sentencepiece installation (mock)

Reviewer:
@ArthurZucker

HuggingFaceDocBuilderDev · 2024-06-07T14:51:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Cool! Nice splitting, let's simplify a bit when we do update the pre_tokenizer and normalizer + add mock tests!

ArthurZucker · 2024-06-18T11:52:10Z

Again the last piece that is missing for me here is to update the __getstate__ that is used when we save the tokenizer to make sure we reset the normalzier_spec to self.add_prefix_space

Looked into it, the normalizer_spec.add_dummy_prefix is already reset when loading from saved. Checked with "huggyllama/llama-7b" and normalizer_spec.add_dummy_prefix was set to True before again forcing to be False

ArthurZucker · 2024-06-18T11:57:51Z

+            sequence["normalizers"] = [pt for pt in curr_state["normalizers"] if pt["type"] not in ["Prepend"]]
+        else:
+            return
+        if getattr(self, "legacy", True):


If self.legacy, we should not touch anything. What we want to fix is people that don't want the legacy behaviour, but don't need to do the slow conversion / don't have sentence piece installed !

if self.legacy is True, we just return

ArthurZucker · 2024-06-18T11:59:08Z

+        """Updates the underlying normalizer with the current `add_prefix_space` and `legacy` settings."""
+        sequence = json.loads(normalizers.Sequence([]).__getstate__())
+        final_sequence = normalizers.Sequence([])
+        if self._tokenizer.normalizer is not None and type(self._tokenizer.normalizer) in (


We can use isinstance(self._tokenizer.normalizer, (normalizers.Prepend, normalizers.Precompiled)) rather than type

ArthurZucker · 2024-06-18T12:00:47Z

+            normalizers.Precompiled,
+        ):
+            # If normalizer is not a Sequence, add it to a sequence
+            sequence["normalizers"].append(json.loads(self._tokenizer.normalizer.__getstate__().decode("utf-8")))


if we have a prepend normalizer, we want to remove it (of course everything should be in the if legacy = False ).

Here again, we only want to update for people who do not want to convert from slow / load the fast tokenizer but legacy is false

ArthurZucker · 2024-06-18T12:05:14Z

+    def _update_pre_tokenizer(self):
+        """Updates the underlying pre-tokenizer with the current `add_prefix_space` setting."""
+
+        # 'add_prefix_space' not passed, try to read from normalizer, otherwise do not set.


Here we should also first check if we have legacy set to False. If it is True we return without doing anything. If legacy is set to False, then we can check if add_prefix_space was passed or not.

If it is not passed, then we :

remove prepend if prepend was in the sequence. I think shipping native pythonic way of doing this is gonna be a priority for me

add the MetaspaceTokenizer if we had the PrependNormalizer

set prepend to "first' if we had metaspace

set add_prefix_space of the bytelevel tokenizer to true / false if there was one

do the same of the decoder (if we update the pre_tokenizer we need to update the decoder as well)

@ArthurZucker Thank you for the detailed review!

Q.

set prepend to "first' if we had metaspace - based on convert_slow_tokenizer, prepend is always only if legacy=True. So here should it be updated to always set to first ? (if we are not touching legacy)

do the same of the decoder (if we update the pre_tokenizer we need to update the decoder as well) - what if there was no pretokenizer and a Metaspace one was added? do we need to add a ByteLevel decoder ?(if yes - for which cases (prepend _scheme=first and/or never)

if legacy:

do nothing

else:

if there is a MetaSpace:

set prepend to "first" if add_prefix_space

set prepend to "never" if not add_prefix_space

if there is a prepend Normalizer:

we could check the type, mostly Llama is the one using it so let's just do a test on the class. I don't expect other class to need this!

replace with a MetaSpace, with either first or never

either replace the decoder, or leave it as it. Most probably we can leave it since LLama decoder was already stripping left and replacing. This is the tricky part since you need to guess what to do for our users. If add_prefix space we need to make sure the decoder does not remove, etc.

IMO we can just replace the decoder if it is strip + replace with metaspace + good prepend scheme!

Thank you! Applied in next review!

notes on the decoder:

decoder's add_prefix_space is updated if decoder is bytelevel

otherwise leave it

if there is no decoder, I do not add one

…update decoder

github-actions · 2024-07-15T08:03:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker

Thanks for working on this feature!
Let's wait a bit for a simpler tokenizers API. I am planning to introduce accessing to the normalizer.Sequence's individual elements, this will make this a lot simpler!

ArthurZucker · 2024-08-04T17:35:09Z

#1590 supports :

indexing
updating with None
updating the indexed element

itazap force-pushed the add_prefix_space_clean branch from 61c9d5b to 1dd79be Compare June 10, 2024 11:47

itazap mentioned this pull request Jun 13, 2024

SPLIT PR: add user defined symbols and control symbols #31305

Merged

9 tasks

itazap requested a review from ArthurZucker June 13, 2024 11:00

itazap marked this pull request as ready for review June 13, 2024 13:28

ArthurZucker reviewed Jun 18, 2024

View reviewed changes

itazap force-pushed the add_prefix_space_clean branch from 787ebd5 to 01dcf33 Compare June 18, 2024 14:51

itazap added 23 commits June 18, 2024 16:52

SPLIT PR: add_prefix_space fix

6c429b2

fix if case

ab3fdcc

missed case

0d53ff1

keep kwargs

458b242

bytelevel handling

acf739c

fix merge

50ec122

case fix

014d396

fix cases

aeca3f1

remove force_froms_slow

2e2db92

consider bytelevel

19ca2af

update cases for pretokenizer updating

328c12b

adding no sentencepiece test

9b8257e

ruff llama

0f1c543

update_test

b9fd882

fix sentencepiece case

b0d4456

add sentencepiece req

584f899

remove test:(

406ccde

ruff

f8f8f38

ruff

c36e616

remove t5 copy from semaless m4

fafe5b7

set pretokenizer if unset

610c319

fix case for pretokenizer

3e64579

added llama and t5 tests for legacy prefix space in fast tokenizer

2e054a7

modify tokenization test to test without sentencepiece

ebf781e

itazap force-pushed the add_prefix_space_clean branch from 01dcf33 to ebf781e Compare June 18, 2024 14:52

itazap added 6 commits June 18, 2024 17:01

removing unused imports

61fd102

Trigger CI attempt

20dc2bb

Trigger CI attempt

83af87b

reverting sentencepiece test

d2990af

update tests to only test legacy=False

d50bbc8

applying feedback to simplify pretokenizer and normalizer updates, + …

f48f1b4

…update decoder

itazap force-pushed the add_prefix_space_clean branch from a3e054d to f48f1b4 Compare June 19, 2024 15:10

itazap added 5 commits June 20, 2024 10:41

remove commont test and ruff

632a21c

updating feedback 2

80ba03e

fix test and ruff

423bce2

fixing test

78bccc6

add test tokenizer without sentencepiece for huggyllama

1de6457

itazap requested a review from ArthurZucker June 20, 2024 15:05

itazap mentioned this pull request Jun 21, 2024

Special token handling breaks idempotency of sentencepiece due to extra spaces #31513

Closed

github-actions bot closed this Jul 23, 2024

itazap reopened this Aug 2, 2024

ArthurZucker reviewed Aug 3, 2024

View reviewed changes

Conversation

itazap commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

convert_slow_tokenizer.py

llama_tokenizer.py

llama_tokenizer_fast.py

tokenization_seamless_m4t.py

tokenization_t5.py

tokenization_t5_fast.py

tokenization_utils.fast.py

Uh oh!

HuggingFaceDocBuilderDev commented Jun 7, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 15, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Aug 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itazap commented Jun 7, 2024 •

edited

Loading

ArthurZucker Jun 20, 2024 •

edited

Loading