💫 Rethink Doc.merge and Span.merge, add Token.split

See also: #1474, #975, #758, #653, #616, #450, #429, #329, #375, #314, #214, #213, #156

Currently `Doc.merge()` and `Span.merge()` promise too much. They let you merge a span while holding references to other spans, and the span indices are supposed to be magically recalculated.

Unsurprisingly this has proven super difficult to get right. It also makes the `Doc.merge()` inefficient for repeated calls, because every time we merge something, we have to set the doc into the correct state.

We'd also really like to have a `token.split()` function, that divides tokens. This is a big hole at the moment, that really lets us down for languages like Chinese where the tokenization is an important part of the annotation to be changed through the pipeline.

I think we should consider having a `Doc.retokenize()` context manager, which you would need to activate before calling `span.merge()` or `token.split()`. This should allow us to make the retokenization more reliable and efficient. Within the block, we could keep a reference to all new `Span` and `Token` objects. Only `Span` and `Token` objects created during retokenization should be used during retokenization.

These changes are being left out of v2, because the v2 policy is to avoid breaking changes to the `Doc`, `Span` and `Token` objects (this makes it more predictable which parts of the application need to be updated).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions