Skip to content

💫 Rethink Doc.merge and Span.merge, add Token.split #1487

@honnibal

Description

@honnibal

See also: #1474, #975, #758, #653, #616, #450, #429, #329, #375, #314, #214, #213, #156

Currently Doc.merge() and Span.merge() promise too much. They let you merge a span while holding references to other spans, and the span indices are supposed to be magically recalculated.

Unsurprisingly this has proven super difficult to get right. It also makes the Doc.merge() inefficient for repeated calls, because every time we merge something, we have to set the doc into the correct state.

We'd also really like to have a token.split() function, that divides tokens. This is a big hole at the moment, that really lets us down for languages like Chinese where the tokenization is an important part of the annotation to be changed through the pipeline.

I think we should consider having a Doc.retokenize() context manager, which you would need to activate before calling span.merge() or token.split(). This should allow us to make the retokenization more reliable and efficient. Within the block, we could keep a reference to all new Span and Token objects. Only Span and Token objects created during retokenization should be used during retokenization.

These changes are being left out of v2, because the v2 policy is to avoid breaking changes to the Doc, Span and Token objects (this makes it more predictable which parts of the application need to be updated).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementFeature requests and improvementsfeat / docFeature: Doc, Span and Token objectshelp wantedContributions welcome!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions