-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Description
Feature request
I would like to propose adding a skip_special_tokens parameter to the .encode() method in Transformers. Currently, in order to achieve this behavior, I have to either create two different tokenizers or use a workaround such as inserting a character in the middle of a special token and then removing it to simulate the desired behavior.
Motivation
The motivation for this feature request is that in real-world scenarios, users may enter any type of textual data, including special tokens used by the tokenizer. If the tokenizer were to tokenize the user's input as is, it would cause confusion for the whole model and impact the performance of the product. The skip_special_tokens parameter is essential for ensuring the correct processing of user inputs, not just for the decode() method but also for the encode() and __call__() methods.
Your contribution
I have implemented my own tokenizer that inherits from Transformers and simulates this behavior by removing the special tokens from the vocab before encoding. However, I believe this approach would not be efficient for scaling up, as it would cause a lot of memory allocations and deallocations.
To address this issue, I suggest implementing two separate dictionaries, one for special tokens and one for the vocabulary, and incorporating an if-statement to test for the skip_special_tokens parameter. This would make the implementation performant and efficient.
Thank you for considering this feature request.