length of tokenizer changes when using main branch causing batch_decode to fail

### System Info

- `transformers` version: 4.34.0.dev0
- Platform: macOS-13.5.1-arm64-arm-64bit
- Python version: 3.9.1
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.20.3
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.1.0.dev20230822 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@ArthurZucker

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

1. create a model using version 4.33.3 (or anything earlier than main)

```python
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small", padding_side="left")

config = GPTBigCodeConfig(
    vocab_size=len(tokenizer), # note: length of tokenizer=384
    n_embd=16,
    n_layer=2,
    n_head=2,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)
model = GPTBigCodeForCausalLM(config)
model.save_pretrained("/path/to/model")
```

2. When using the latest version in main (4.34.0.dev0), load the model and tokenizer

```python
# load the model and tokenizer using version in main branch
tokenizer = AutoTokenizer.from_pretrained("google/byt5-small", padding_side="left")

# note: model here now has vocab_size=384 from prior version, and tokenizer has length of 381 from current version
model = AutoModelForCausalLM.from_pretrained("/path/to/model")
```

3. simulated prediction of model returns generated_ids that are between 381-384

```python
# model predicts some value between 381-384
generated_ids = model.generate(...)
```

4. perform batch_decode on generated_ids

```python
# note: generated_ids will contain a value between 381-384
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
```

results in:

```
    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        bstring = b""
        for token in tokens:
            if token in self.added_tokens_decoder:
                tok_string = self.added_tokens_decoder[token].encode("utf-8")
            elif token in self.added_tokens_encoder:
                tok_string = token.encode("utf-8")
            else:
>               tok_string = bytes([ord(token)])
E               ValueError: bytes must be in range(0, 256)
```




### Expected behavior

The tokenizer length should not change from version to version, this could cause issues with consistency between models and tokenizers

Version 4.33.3:

```python
# returns 384
len(tokenizer)
```

Version 4.34.0.dev0:

```python
# returns 381
len(tokenizer)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

length of tokenizer changes when using main branch causing batch_decode to fail #26452

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

length of tokenizer changes when using main branch causing batch_decode to fail #26452

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions