Adding a skip_special_tokens Parameter to .encode() in Transformers

### Feature request

I would like to propose adding a skip_special_tokens parameter to the .encode() method in Transformers. Currently, in order to achieve this behavior, I have to either create two different tokenizers or use a workaround such as inserting a character in the middle of a special token and then removing it to simulate the desired behavior.

### Motivation

The motivation for this feature request is that in real-world scenarios, users may enter any type of textual data, including special tokens used by the tokenizer. If the tokenizer were to tokenize the user's input as is, it would cause confusion for the whole model and impact the performance of the product. The skip_special_tokens parameter is essential for ensuring the correct processing of user inputs, not just for the `decode()` method but also for the `encode()` and `__call__()` methods.

### Your contribution

I have implemented my own tokenizer that inherits from Transformers and simulates this behavior by removing the special tokens from the vocab before encoding. However, I believe this approach **would not be efficient** for scaling up, as it would cause a lot of memory allocations and deallocations.

To address this issue, I suggest implementing **two separate dictionaries**, one for special tokens and one for the vocabulary, and incorporating an if-statement to test for the skip_special_tokens parameter. This would make the implementation performant and efficient.

Thank you for considering this feature request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding a skip_special_tokens Parameter to .encode() in Transformers #22490

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions