Skip to content

Merge hangs #214

@sdenning

Description

@sdenning

Great job on SpaCy, it is really impressive!

I'm pos tagging news articles and merging the tags for the named entities. Many thousands of the articles are fine. Merge hangs up on the text here though.

from __future__ import unicode_literals

import spacy.en
from spacy.en import English

nlp = English()
doc = nlp('text', tag=True, parse=True)

text = '"But there is no telling what the hurricane damage will be," Anderson said, in reference to Hurricane Fran which swept through North and South Carolina last week. Some industry sources said the hurricane may have brought beneficial rains while others predicted losses and damage to open bolls. "Fran changes the picture considerably but it is too soon (to determine losses)," said Jarral Neeper, an analyst with Calcot Ltd. He pegged the estimate at 18.717 million bales. Smith Barney analyst David Brandon compared the effect of Hurricane Fran to that of Hurricane Hugo in 1989, which he said buoyed yields in North and South Carolina.'

doc = nlp(text, tag=True)

for ent in doc.ents:
    if len(ent.orth_.split()) > 1:
      start = text.index(ent.orth_)
      end = start+len(ent.orth_)
      print ent.orth_ + ' start: ' + str(start) + ' ' + 'end: ' + str(end) + ' ' + 'entity: ' + ent.label_
      doc.merge(start, end, '', '', ent.label_)
      for token in doc:    
          print token.orth_

Here's the output I get:

Hurricane Fran start: 92 end: 106 entity: EVENT
South Carolina start: 137 end: 151 entity: GPE
last week start: 152 end: 161 entity: DATE
Jarral Neeper start: 381 end: 394 entity: PERSON
Calcot Ltd. start: 412 end: 423 entity: ORG
18.717 million bales start: 450 end: 470 entity: QUANTITY
Smith Barney start: 472 end: 484 entity: ORG
David Brandon start: 493 end: 506 entity: PERSON
Hurricane Fran start: 92 end: 106 entity: EVENT
Hurricane Hugo start: 556 end: 570 entity: EVENT
North and South Carolina start: 127 end: 151 entity: GPE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions