-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
spaCy v2.0.0 Span.merge() method has bugs #1474
Description
[Update 1] You can now copy the above code snippet as-is in a Jupyter notebook and test.
[Update 2] This indeed seems to be a bug in v2. I do not face this problem in v1.9.0. In fact, NER in v1.9.0 already recognizes all the dates properly.
Original message
For my text corpus, spaCy NER tagger is consistently mislabeling DATE. So I have written a simple script to check whether there is a CARDINAL which is immediately followed by DATE, and if yes then I merge the two spans and apply the DATE label. Consider the following script.
`
import spacy
from spacy import displacy
nlp = spacy.load('en')
txt = 'In a letter dated 17 April 2007 Mr. X wrote to Mr. Y about the impending merger of \
Big Canyon Inc. and Little Mushroom Co Ltd. questioning two stock sell-offs.\
\n1. 01 September 2006: US$ 10000.00 worth of stocks.\
\n2. 03 October 2006: US$ 50000.00 worth of stocks.'
doc = nlp(txt)
lastCardinal = None
for e in doc.ents:
if e.label_ == 'CARDINAL':
lastCardinal = e
if e.label_ == 'DATE' and lastCardinal:
# The script does not produce the intended output if
# this print statement is deleted or commented out.
print("{} CARDINAL[{}, {}], {} DATE[{}, {}]".format(lastCardinal.text,
lastCardinal.start,
lastCardinal.end,
e.text, e.start, e.end))
if lastCardinal.end == e.start:
# A DATE span immediately follows a CARDINAL span.
# So merge them, and apply DATE label to the result.
doc[lastCardinal.start:e.end].merge(label=e.label)
lastCardinal = None
displacy.render(doc, style='ent', jupyter=True)
`
Now, if I comment out the print statement, it's as if the script is nearly ineffective . I see the following result.
If I execute the script as is, i.e. with the print statement, I see the intended effect of my script.
Note however, that the first date 17 April 2007 was affected by the script (as is seen from the output of the print statement) in both cases. But the other two dates seem unaffected in the first case.
AFAIK, there is no multi-threading at this level, which is usually a source of such unexpected behaviors. Is this a bug? How can I provide more detail to help trace it?
Thanks in advance.
My Environment
- Python version: 3.5.4
- Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid
- spaCy version: 2.0.0a17
- Models: en

