I noticed that the tokenizer includes emotional or annotation style tokens like [laughter] and [sigh], but the model doesn't seem to generate or follow them during inference even in contexts where such expressions would be appropriate, such as transcriptions or dialogue. I'm curious: were these tokens actually used during training, and is there a way to prompt the model to use them more naturally? If not actively used, what’s the rationale behind keeping them in the tokenizer vocabulary? Could this inconsistency be due to a lack of properly annotated training data?
I noticed that the tokenizer includes emotional or annotation style tokens like [laughter] and [sigh], but the model doesn't seem to generate or follow them during inference even in contexts where such expressions would be appropriate, such as transcriptions or dialogue. I'm curious: were these tokens actually used during training, and is there a way to prompt the model to use them more naturally? If not actively used, what’s the rationale behind keeping them in the tokenizer vocabulary? Could this inconsistency be due to a lack of properly annotated training data?