Skip to content

Slow entity merging on large texts #653

@ELind77

Description

@ELind77

First, I'd like to say that I'm really impressed with what you guys have been doing with spacy and the 1.X releases look great.

My issue is that when I do entity merging on a large document (novel length) it takes an unbelievably long time (as much as 10 minutes).

This takes about 12 seconds on four novels:

def tokenize(s):
    return [tok.lemma_ for tok in nlp(s, parse=False) if not tok.is_space]

But this takes up to 50 minutes on the same inputs:

def tokenize_merge(s):
    doc = nlp(s, parse=False)
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.label_)
    return [tok.lemma_ for tok in doc if not tok.is_space]

(It actually took 24 minutes on a system running 0.101 and 51 minutes on a comperable machine running 1.2.0 but I only ran it a couple of times on each machine so that's not really conclusive.)

The second function is adapted from the old documentation and the sense2vec code.

My theory, based on documentation that says that token indices are changed when merging occurs, is that merge is O(n) on the length of the document. Rather than try and dive into the code I thought I'd just ask, what is going on here?

Is there a better way to do this than what I'm doing? It looks like some of the new callback functionality may be able to perform the merging for me? What if I can guarantee that the document will never be sliced into, it will only ever be iterated over from the beginning, so it doesn't matter if the indices are messed up?

I love the ability to merge tokens but unless there's a faster way to do it (I really hope I'm just doing it wrong), I don't think it's usable on input of the size I'm working with.

Details

  • Combined length of the input docs ~829000 tokens
  • Python 2.7.6
  • Ubuntu 14.04 (yes, I know it's time to upgrade, I'm working on it)

Side notes:

It's even worse when I do NPs too (I assume because there are more of them), but I don't have the times because I didn't let it finish.
If I'm right, and this is O(n), you could get a tiny decrease in constant factors by starting from the end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat / nerFeature: Named Entity Recognizerhelp wantedContributions welcome!help wanted (easy)Contributions welcome! (also suited for spaCy beginners)perf / speedPerformance: speed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions