-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
💫 Rethink Doc.merge and Span.merge, add Token.split #1487
Description
See also: #1474, #975, #758, #653, #616, #450, #429, #329, #375, #314, #214, #213, #156
Currently Doc.merge() and Span.merge() promise too much. They let you merge a span while holding references to other spans, and the span indices are supposed to be magically recalculated.
Unsurprisingly this has proven super difficult to get right. It also makes the Doc.merge() inefficient for repeated calls, because every time we merge something, we have to set the doc into the correct state.
We'd also really like to have a token.split() function, that divides tokens. This is a big hole at the moment, that really lets us down for languages like Chinese where the tokenization is an important part of the annotation to be changed through the pipeline.
I think we should consider having a Doc.retokenize() context manager, which you would need to activate before calling span.merge() or token.split(). This should allow us to make the retokenization more reliable and efficient. Within the block, we could keep a reference to all new Span and Token objects. Only Span and Token objects created during retokenization should be used during retokenization.
These changes are being left out of v2, because the v2 policy is to avoid breaking changes to the Doc, Span and Token objects (this makes it more predictable which parts of the application need to be updated).