-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Merging spans - discussion #156
Description
Hi!
I've been working on relation extraction using spaCy for the past week or so and the API has been very convenient and the results are great 😄 - lovely lib!
However I have stumbled upon the problem of span merging as pointed out in comments in the code - namely the problem that spans that have been extracted earlier are invalidated. How are you thinking of approaching this problem?
I have solved it partially for my purposes, but its a bit of an ugly hack:
# iterating in reverse order doesn't invalidate next spans we want to process
# since we are shrinking the doc starting from the end
for sentence in list(reversed(list(doc.sents))):
# same for entities
for ent in reversed(sentence.ents):
# collapse function basically merges and uses default stuff for labels
ent.collapse()
# get sentences again - with correct spans
for sentence in doc.sents:
# do stuff you wanted to do with collapsed entities hereI extended Spans to have the same ents property that is defined in Doc and added a collapse function which basically achieves the same as merging the span using reasonable labels. I know adding the ents property to a Span doesn't always make sense - but in the case where the span is a sentence or a noun phrase it does. A sentence seems to be a quite specific type of span - many properties of doc make sense for a sentence span. But i digress.
def collapse(self):
start_idx = self[0].idx
end_idx = self[-1].idx + len(self[-1])
lemma = u' '.join(word.lemma_ for word in self)
ent_type = max([word.ent_type_ for word in self])
merged = self.doc.merge(start_idx, end_idx, self.root.tag_, lemma, ent_type)
return mergedI thought of changing the merge function so that it didn't shrink the doc, simply replacing the tokens that have been merged using a special placeholder in the array, and then in iter iterate through the array as normal, and only yield the object if it is not that placeholder. However this isn't compatible with getting items using an index. Any ideas?