-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Closed
Description
System Info
transformersversion: 4.34.0.dev0- Platform: macOS-13.5.1-arm64-arm-64bit
- Python version: 3.9.1
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.20.3
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0.dev20230822 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- create a model using version 4.33.3 (or anything earlier than main)
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small", padding_side="left")
config = GPTBigCodeConfig(
vocab_size=len(tokenizer), # note: length of tokenizer=384
n_embd=16,
n_layer=2,
n_head=2,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
model = GPTBigCodeForCausalLM(config)
model.save_pretrained("/path/to/model")- When using the latest version in main (4.34.0.dev0), load the model and tokenizer
# load the model and tokenizer using version in main branch
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small", padding_side="left")
# note: model here now has vocab_size=384 from prior version, and tokenizer has length of 381 from current version
model = AutoModelForCausalLM.from_pretrained("/path/to/model")- simulated prediction of model returns generated_ids that are between 381-384
# model predicts some value between 381-384
generated_ids = model.generate(...)- perform batch_decode on generated_ids
# note: generated_ids will contain a value between 381-384
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)results in:
def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
bstring = b""
for token in tokens:
if token in self.added_tokens_decoder:
tok_string = self.added_tokens_decoder[token].encode("utf-8")
elif token in self.added_tokens_encoder:
tok_string = token.encode("utf-8")
else:
> tok_string = bytes([ord(token)])
E ValueError: bytes must be in range(0, 256)
Expected behavior
The tokenizer length should not change from version to version, this could cause issues with consistency between models and tokenizers
Version 4.33.3:
# returns 384
len(tokenizer)Version 4.34.0.dev0:
# returns 381
len(tokenizer)Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels