Skip to content

spaCy v2.0.0 Span.merge() method has bugs #1474

@chaturv3di

Description

@chaturv3di

[Update 1] You can now copy the above code snippet as-is in a Jupyter notebook and test.
[Update 2] This indeed seems to be a bug in v2. I do not face this problem in v1.9.0. In fact, NER in v1.9.0 already recognizes all the dates properly.

Original message

For my text corpus, spaCy NER tagger is consistently mislabeling DATE. So I have written a simple script to check whether there is a CARDINAL which is immediately followed by DATE, and if yes then I merge the two spans and apply the DATE label. Consider the following script.

`

import spacy
from spacy import displacy

nlp = spacy.load('en')
txt = 'In a letter dated 17 April 2007 Mr. X wrote to Mr. Y about the impending merger of \
      Big Canyon Inc. and Little Mushroom Co Ltd. questioning two stock sell-offs.\
      \n1. 01 September 2006: US$ 10000.00 worth of stocks.\
      \n2. 03 October 2006: US$ 50000.00 worth of stocks.'

doc = nlp(txt)
lastCardinal = None
for e in doc.ents:
    if e.label_ == 'CARDINAL':
        lastCardinal = e
    if e.label_ == 'DATE' and lastCardinal:
         # The script does not produce the intended output if 
         # this print statement is deleted or commented out.
         print("{} CARDINAL[{}, {}], {} DATE[{}, {}]".format(lastCardinal.text, 
                                                             lastCardinal.start, 
                                                             lastCardinal.end, 
                                                             e.text, e.start, e.end))
        if lastCardinal.end == e.start:
            # A DATE span immediately follows a CARDINAL span.
            # So merge them, and apply DATE label to the result.
            doc[lastCardinal.start:e.end].merge(label=e.label)
        lastCardinal = None

displacy.render(doc, style='ent', jupyter=True)

`

Now, if I comment out the print statement, it's as if the script is nearly ineffective . I see the following result.

screenshot-2017-10-31 str_ner-labeling

If I execute the script as is, i.e. with the print statement, I see the intended effect of my script.

screenshot-2017-10-31 str_ner-labeling 1

Note however, that the first date 17 April 2007 was affected by the script (as is seen from the output of the print statement) in both cases. But the other two dates seem unaffected in the first case.

AFAIK, there is no multi-threading at this level, which is usually a source of such unexpected behaviors. Is this a bug? How can I provide more detail to help trace it?

Thanks in advance.

My Environment

  • Python version: 3.5.4
  • Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid
  • spaCy version: 2.0.0a17
  • Models: en

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugBugs and behaviour differing from documentationfeat / docFeature: Doc, Span and Token objects

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions