Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0

I'm not sure if this is an issue with transformers, an issue with tokenizers or expected behavior. But when running the token classification pipeline with an aggregation_strategy="simple" the results are slightly different with tokenizers==0.12.0.

The following code produces different results (both examples use transformers==4.17.0).

```python
from transformers import pipeline
nlp = pipeline("token-classification")
nlp("Hugging Face Inc. is a company based in New York City", aggregation_strategy="simple")
```

With tokenizers==0.11.6:

```
[{'entity_group': 'ORG', 'score': 0.99305606, 'word': 'Hugging Face Inc', 'start': 0, 'end': 16}, {'entity_group': 'LOC', 'score': 0.9988098, 'word': 'New York City', 'start': 40, 'end': 53}]
```

With tokenizers==0.12.0:

```
[{'entity_group': 'ORG', 'score': 0.99305606, 'word': ['Hu', 'gging', ' Face', ' Inc'], 'start': 0, 'end': 16}, {'entity_group': 'LOC', 'score': 0.9988098, 'word': ['New', ' York', ' City'], 'start': 40, 'end': 53}]
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Token classification pipeline results different with tokenizers==0.11.6 vs tokenizers==0.12.0 #16520

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions