Skip RoFormer ONNX test if rjieba not installed#16981
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
94578a0 to
5ab3067
Compare
|
So this test is then currently just skipped on our ONNX tests? Should we maybe not better add |
|
It looks like RoFormer tokenization is completely untested yes, so this package should be added in the |
|
I agree we should include Further remarkIf we could not add |
|
Thanks for the feedback - I'll add |
| pass | ||
|
|
||
| # can't serialise custom PreTokenizer | ||
| def test_save_slow_from_fast_and_reload_fast(self): |
There was a problem hiding this comment.
This test was failing with Exception: Custom PreTokenizer cannot be serialized:
def test_save_slow_from_fast_and_reload_fast(self):
if not self.test_slow_tokenizer or not self.test_rust_tokenizer:
# we need both slow and fast versions
return
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
with tempfile.TemporaryDirectory() as tmp_dir_1:
# Here we check that even if we have initialized a fast tokenizer with a tokenizer_file we can
# still save only the slow version and use these saved files to rebuild a tokenizer
tokenizer_fast_old_1 = self.rust_tokenizer_class.from_pretrained(
pretrained_name, **kwargs, use_fast=True
)
tokenizer_file = os.path.join(tmp_dir_1, "tokenizer.json")
> tokenizer_fast_old_1.backend_tokenizer.save(tokenizer_file)
E Exception: Custom PreTokenizer cannot be serialized
Looking at this similar issue huggingface/tokenizers#613 it seems that RoFormer belong to the class of tokenizers that can't be saved with tokenizer.backend_tokenizer.save(). Here's an example to reproduce:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True)
tokenizer.backend_tokenizer.save(".")I'm not very familiar with RoFormer, so happy to debug this further if we expect this test really should pass.
There was a problem hiding this comment.
This one I think we can definitely skip since this tokenizer cannot be saved in the "fast" format
| pass | ||
|
|
||
| # can't serialise custom PreTokenizer | ||
| def test_saving_tokenizer_trainer(self): |
There was a problem hiding this comment.
This test fails for a different reason - as far as I can tell, saving the fast tokenizer and loading it again fails because vocab_file is None on the init:
src/transformers/tokenization_utils_base.py:1783: in from_pretrained
return cls._from_pretrained(
src/transformers/tokenization_utils_base.py:1809: in _from_pretrained
slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
src/transformers/tokenization_utils_base.py:1918: in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
src/transformers/models/roformer/tokenization_roformer.py:145: in __init__
if not os.path.isfile(vocab_file):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
path = None
def isfile(path):
"""Test whether a path is a regular file"""
try:
> st = os.stat(path)
E TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Here's an example to reproduce:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("junnyu/roformer_chinese_small", use_fast=True)
tokenizer.save_pretrained("./tmp/roformer", legacy_format=False)
AutoTokenizer.from_pretrained("./tmp/roformer/")There was a problem hiding this comment.
I think you discovered a bug here @lewtun !
I've done some digging and I think the bug is that the vocab file is not correctly defined for RoFormer I think.
Could you replace
by:
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt", "tokenizer_file": "tokenizer.json"}This should make your codesnippet above work and hopefully also make the test pass
There was a problem hiding this comment.
Thanks for the advice! Fixed in 1abc58a
With this fix, the following slow tests now all pass:
RUN_SLOW=1 pytest tests/roformer/test_tokenization_roformer.py -s
|
Hey @sgugger @patrickvonplaten I'm hitting some peculiar issues with 2 of the slow tests of the RoFormer tokenizer. Would you mind taking a look and seeing if my decision to skip them is valid? |
|
I'll let @patrickvonplaten decide as I know nothing on that model too :-) |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Think we can fix 1 test by correcting a bug as mentioned in a comment. Happy to skip the other test and merge afterward :-)
|
There is a test dedicated to custom tokenizers with specific dependencies: https://github.com/huggingface/transformers/blob/main/.circleci/config.yml#L538 It installs |
Thanks for the tip! Done in 3cafcb2 |
| - v0.4-{{ checksum "setup.py" }} | ||
| - run: pip install --upgrade pip | ||
| - run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy] | ||
| - run: pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba] |
There was a problem hiding this comment.
This adds rjieba to the custom tokenizer tests on CircleCI
|
Hey @patrickvonplaten @LysandreJik I think this PR is ready for a final pass :) The failing test is unrelated to the PR itself (a failing Pegasus generate test) |
sgugger
left a comment
There was a problem hiding this comment.
You'll need to rebase for the move of the test files. Otherwise LGTM!
* Skip RoFormer ONNX test if rjieba not installed * Update deps table * Skip RoFormer serialization test * Fix RoFormer vocab * Add rjieba to CircleCI
What does this PR do?
This PR adds the
@require_rjiebadecorator to the slow ONNX tests to deal with the following error in our daily CI runs:I wasn't sure ifrjiebashould actually be installed in the GitHub workflow, but it doesn't seem to be the case for the RoFormer tests and so I omitted that for now.Edit: I've added
rjiebato the"tests"extras and also tested that the slow ONNX test passes when this dep is installed: