Skip to content

[DOCS] Update docstrings for GPT2 and Whisper tokenizer#26642

Merged
ydshieh merged 4 commits intohuggingface:mainfrom
McDonnellJoseph:docs/update-gpt2-docstring
Oct 12, 2023
Merged

[DOCS] Update docstrings for GPT2 and Whisper tokenizer#26642
ydshieh merged 4 commits intohuggingface:mainfrom
McDonnellJoseph:docs/update-gpt2-docstring

Conversation

@McDonnellJoseph
Copy link
Contributor

@McDonnellJoseph McDonnellJoseph commented Oct 6, 2023

What does this PR do?

  • [x ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x ] Did you read the contributor guideline,
    Pull Request section?
  • [ x] Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • [ x] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

@McDonnellJoseph
Copy link
Contributor Author

@ydshieh Does this look good for you?

@McDonnellJoseph McDonnellJoseph changed the title [DOCS] Update docstrings for and tokenizer [DOCS] Update docstrings for GPT2 and Whisper tokenizer Oct 7, 2023
Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but ready to go 🔥

eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
The end of sequence token.
pad_token (`str`, *optional*):
Id of the padding token in the vocabulary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Id of the padding token in the vocabulary.
The token used for padding, for example when batching sequences of different lengths.

eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
The end of sequence token.
pad_token (`str`, *optional*):
Id of the padding token in the vocabulary.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Id of the padding token in the vocabulary.
The token used for padding, for example when batching sequences of different lengths.

@McDonnellJoseph
Copy link
Contributor Author

@ydshieh I tried troubleshooting on my side I don't understanding the difference between vocab_file and merges_file, both merges.txt and vocab.json from the GPT2 model card don't seem to work. Maybe something needs to be updated? I'd be happy to take a look

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 9, 2023

Hi @McDonnellJoseph It's fine. No need to dive into those 2 files. I requests 2 tiny changes. Once you commit them, we are good to merge the PR.

@McDonnellJoseph
Copy link
Contributor Author

@ydshieh No problem I pushed the requested changes 😄

@McDonnellJoseph McDonnellJoseph force-pushed the docs/update-gpt2-docstring branch 2 times, most recently from 36e4143 to 04d9a82 Compare October 10, 2023 08:01
@ydshieh
Copy link
Collaborator

ydshieh commented Oct 10, 2023

Hi, thanks for the commit. May I know why there is irrelevant changes other than pad token id in the last commit pushed?

@McDonnellJoseph McDonnellJoseph force-pushed the docs/update-gpt2-docstring branch from 64c8e08 to b100203 Compare October 10, 2023 16:22
@McDonnellJoseph
Copy link
Contributor Author

Sorry my linter formatted automatically and I dind't notice it should be fixed now

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 11, 2023

We are very close: just need to make style and push the change. tokenization_whisper.py has some format issue

@McDonnellJoseph
Copy link
Contributor Author

Ok I thing we're good now 😄

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you for the contribution 💯 !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ydshieh ydshieh merged commit b4199c2 into huggingface:main Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants