Skip to content

SPLIT PR: add user defined symbols and control symbols#31305

Merged
itazap merged 9 commits intomainfrom
user_defined_symbols_add_30824
Jun 21, 2024
Merged

SPLIT PR: add user defined symbols and control symbols#31305
itazap merged 9 commits intomainfrom
user_defined_symbols_add_30824

Conversation

@itazap
Copy link
Copy Markdown
Collaborator

@itazap itazap commented Jun 7, 2024

Fixes portion of #30824 ->

  • adds user_defined_symbols and control_symbols from proto so that they are not split during encoding / decoding.
  • tests gemma
  • removed same logic from GemmaConverter as it is handled in general SpmConverter class
  • updates common test (!!!) bc fast != slow since fast an read from SPM converter and get user_defined_symbols and control_symbols. @ArthurZucker thoughts on this?
  • copied test from common to camember and rembert tests
  • updated deberta v2 tests to not use '.' in test cases' texts because it is a user added token for fast tokenizers, so the spacing around it differs from slow.
  • add_special_tokens and add_tokens has same effect for now - but fix is already merged in Tokenizers by @ArthurZucker remove enforcement of non special when adding tokens tokenizers#1521

TODO in new PR:

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap requested a review from ArthurZucker June 7, 2024 10:17
@itazap itazap changed the title PR SPLIT: moving origina changes for adding user defined symbols PR SPLIT: moving original changes for adding user defined symbols Jun 12, 2024
@ArthurZucker ArthurZucker removed their request for review June 12, 2024 15:13
@itazap itazap changed the title PR SPLIT: moving original changes for adding user defined symbols PR SPLIT: add user defined symbols and control symbols Jun 13, 2024
@itazap itazap requested a review from ArthurZucker June 13, 2024 08:39
@itazap itazap marked this pull request as ready for review June 13, 2024 13:28
@itazap itazap force-pushed the user_defined_symbols_add_30824 branch 6 times, most recently from 97d0d35 to befadd1 Compare June 17, 2024 08:57
@itazap itazap force-pushed the user_defined_symbols_add_30824 branch from befadd1 to 904121e Compare June 17, 2024 09:00
@itazap itazap changed the title PR SPLIT: add user defined symbols and control symbols SPLIT PR: add user defined symbols and control symbols Jun 18, 2024
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think we should check the type of the defined symbols see this comment offline:

            if piece.type == 4:
                tokens_to_add += [piece.piece]

regarding trainer_spec.user_defined_symbols not giving the same results

AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols
]

tokenizer.add_tokens(user_defined_symbols)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Related fix to only add once: huggingface/tokenizers@f2ec3b2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe using map will be faster? but anyway it's good enough fro what we want !

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the encoded sequence here?

…e instead of trainer_spec for user_defined_symbols
@itazap itazap requested a review from ArthurZucker June 20, 2024 15:05
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Last nit and good to go

# Add user defined symbols
user_defined_symbols = [
AddedToken(token, normalized=False, special=False)
for token in [p.piece for p in self.proto.pieces if p.type == 4]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment that references the documentation of sentencepiece as to why we use 4 (arbitrary number)

AddedToken(token, normalized=False, special=True) for token in self.proto.trainer_spec.control_symbols
]

tokenizer.add_tokens(user_defined_symbols + control_symbols)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this will work with the release of tokenizers, so fine for me! Good job

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also propagate to gemma! (only has this for now:

  user_defined_symbols = [
            AddedToken(token, normalized=True, special=False) for token in proto.trainer_spec.user_defined_symbols
        ]
        tokenizer.add_tokens(user_defined_symbols)
       

@itazap itazap merged commit 1e79ead into main Jun 21, 2024
@itazap itazap deleted the user_defined_symbols_add_30824 branch June 21, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants