Skip to content

SPLIT PR: add_prefix_space fix#31315

Open
itazap wants to merge 35 commits intomainfrom
add_prefix_space_clean
Open

SPLIT PR: add_prefix_space fix#31315
itazap wants to merge 35 commits intomainfrom
add_prefix_space_clean

Conversation

@itazap
Copy link
Copy Markdown
Collaborator

@itazap itazap commented Jun 7, 2024

Fixes portion of #30824

Adds support for add_prefix_space.

TASKS:

convert_slow_tokenizer.py

  • add_prefix_space is set based on original_tokenizer, otherwise based on proto's noramlizer.

llama_tokenizer.py

  • add_prefix_space is updated based on normalizer if unset.

llama_tokenizer_fast.py

  • remove forcing from_slow conversion if add_prefix_space is not set (see tokenization_utils_fast.py updates on how this is handled without conversion). This allows passing this field wthout sentencepiece installed.

tokenization_seamless_m4t.py

  • updated comment 'Copies' - ruff complained

tokenization_t5.py

  • do not set prefix to True, set based on normalizer

tokenization_t5_fast.py

  • set prefix space, do not force from slow if set

tokenization_utils.fast.py

  • set prefix space
  • allow using add_prefix_space without converting from slow, thus allowing to use without sentencepiece installed
  • update rust pre_tokenizer and normalizer based on add_prefix_space and underlying normalizer. legacy logic copied from convert_from_slow.py

TESTS:

  • test legacy tokenizer in llama
  • test legacy tokenizer in t5
  • test tokenizer without sentencepiece installation (mock)

Reviewer:
@ArthurZucker

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap force-pushed the add_prefix_space_clean branch from 61c9d5b to 1dd79be Compare June 10, 2024 11:47
@itazap itazap requested a review from ArthurZucker June 13, 2024 11:00
@itazap itazap marked this pull request as ready for review June 13, 2024 13:28
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Nice splitting, let's simplify a bit when we do update the pre_tokenizer and normalizer + add mock tests!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the last piece that is missing for me here is to update the __getstate__ that is used when we save the tokenizer to make sure we reset the normalzier_spec to self.add_prefix_space

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into it, the normalizer_spec.add_dummy_prefix is already reset when loading from saved. Checked with "huggyllama/llama-7b" and normalizer_spec.add_dummy_prefix was set to True before again forcing to be False

sequence["normalizers"] = [pt for pt in curr_state["normalizers"] if pt["type"] not in ["Prepend"]]
else:
return
if getattr(self, "legacy", True):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self.legacy, we should not touch anything. What we want to fix is people that don't want the legacy behaviour, but don't need to do the slow conversion / don't have sentence piece installed !

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.legacy is True, we just return

"""Updates the underlying normalizer with the current `add_prefix_space` and `legacy` settings."""
sequence = json.loads(normalizers.Sequence([]).__getstate__())
final_sequence = normalizers.Sequence([])
if self._tokenizer.normalizer is not None and type(self._tokenizer.normalizer) in (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use isinstance(self._tokenizer.normalizer, (normalizers.Prepend, normalizers.Precompiled)) rather than type

normalizers.Precompiled,
):
# If normalizer is not a Sequence, add it to a sequence
sequence["normalizers"].append(json.loads(self._tokenizer.normalizer.__getstate__().decode("utf-8")))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a prepend normalizer, we want to remove it (of course everything should be in the if legacy = False ).

Here again, we only want to update for people who do not want to convert from slow / load the fast tokenizer but legacy is false

def _update_pre_tokenizer(self):
"""Updates the underlying pre-tokenizer with the current `add_prefix_space` setting."""

# 'add_prefix_space' not passed, try to read from normalizer, otherwise do not set.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should also first check if we have legacy set to False. If it is True we return without doing anything. If legacy is set to False, then we can check if add_prefix_space was passed or not.

If it is not passed, then we :

  • remove prepend if prepend was in the sequence. I think shipping native pythonic way of doing this is gonna be a priority for me
  • add the MetaspaceTokenizer if we had the PrependNormalizer
  • set prepend to "first' if we had metaspace
  • set add_prefix_space of the bytelevel tokenizer to true / false if there was one
  • do the same of the decoder (if we update the pre_tokenizer we need to update the decoder as well)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker Thank you for the detailed review!

Q.

  • set prepend to "first' if we had metaspace - based on convert_slow_tokenizer, prepend is always only if legacy=True. So here should it be updated to always set to first ? (if we are not touching legacy)

  • do the same of the decoder (if we update the pre_tokenizer we need to update the decoder as well) - what if there was no pretokenizer and a Metaspace one was added? do we need to add a ByteLevel decoder ?(if yes - for which cases (prepend _scheme=first and/or never)

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • if legacy:
    • do nothing
  • else:
    • if there is a MetaSpace:
      • set prepend to "first" if add_prefix_space
      • set prepend to "never" if not add_prefix_space
    • if there is a prepend Normalizer:
      • we could check the type, mostly Llama is the one using it so let's just do a test on the class. I don't expect other class to need this!
      • replace with a MetaSpace, with either first or never
      • either replace the decoder, or leave it as it. Most probably we can leave it since LLama decoder was already stripping left and replacing. This is the tricky part since you need to guess what to do for our users. If add_prefix space we need to make sure the decoder does not remove, etc.
      • IMO we can just replace the decoder if it is strip + replace with metaspace + good prepend scheme!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Applied in next review!

notes on the decoder:

  • decoder's add_prefix_space is updated if decoder is bytelevel
  • otherwise leave it
  • if there is no decoder, I do not add one

@itazap itazap force-pushed the add_prefix_space_clean branch from 787ebd5 to 01dcf33 Compare June 18, 2024 14:51
@itazap itazap force-pushed the add_prefix_space_clean branch from 01dcf33 to ebf781e Compare June 18, 2024 14:52
@itazap itazap force-pushed the add_prefix_space_clean branch from a3e054d to f48f1b4 Compare June 19, 2024 15:10
@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jul 23, 2024
@itazap itazap reopened this Aug 2, 2024
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this feature!
Let's wait a bit for a simpler tokenizers API. I am planning to introduce accessing to the normalizer.Sequence's individual elements, this will make this a lot simpler!

@ArthurZucker
Copy link
Copy Markdown
Collaborator

#1590 supports :

  • indexing
  • updating with None
  • updating the indexed element

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants