-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Doc Example with bad html tags: Html tags are not marked as stop words #2657
Copy link
Copy link
Closed
Labels
bugBugs and behaviour differing from documentationBugs and behaviour differing from documentationdocsDocumentation and websiteDocumentation and website
Description
The sample from the docs https://spacy.io/usage/linguistic-features (with small changes to demonstrate an error) :
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
# we're using a class because the component needs to be initialised with
# the shared vocab via the nlp object
class BadHTMLMerger(object):
def __init__(self, nlp):
# register a new token extension to flag bad HTML
Token.set_extension('bad_html', default=False, force=True)
self.matcher = Matcher(nlp.vocab)
self.matcher.add('BAD_HTML', None,
[{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
[{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])
def __call__(self, doc):
# this method is invoked when the component is called on a Doc
matches = self.matcher(doc)
spans = [] # collect the matched spans here
for match_id, start, end in matches:
spans.append(doc[start:end])
for span in spans:
span.merge() # merge
for token in span:
token._.bad_html = True # mark token as bad HTML
doc.vocab[token.text].is_stop = True # mark lexeme as stop word
return doc
nlp = spacy.load('en_core_web_sm')
html_merger = BadHTMLMerger(nlp)
nlp.add_pipe(html_merger, last=True) # add component to the pipeline
doc = nlp(u"Hello<br>world! <br/> This <br> is a test.")
for token in doc:
print(token.text, token._.bad_html, token.is_stop)
Here the first <br> is not marked as stop word. And token.is_stop is not writable. How I can mark all
tags as stop words?
spacy 2.0.11, python Python 3.6.6 :: Anaconda custom (64-bit), win 10 64 bit
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugBugs and behaviour differing from documentationBugs and behaviour differing from documentationdocsDocumentation and websiteDocumentation and website