Skip to content

Some css/html remains after running plain_text() #115

@feacluster

Description

@feacluster

I am using html2text to reduce the output of plain_text to actual plain text. It works great, but notice stuff like this is not getting eliminated. Not sure if a problem with wikitextparser or html2text or user error? See minimum reproducible example with my workaround hack below:

[feacluster@micro wikipedia]$ cat test.py
text = """

==Comparison of green, teal, blue and ultramarine ==
{| class="wikitable sortable" style="width:100%"
|-
!Name
!width=100|Color
!HEX Code
!Red
!Green
!Blue
!Hue
!Sat
!Lum
|[[Ultramarine]] (Electric Ultramarine)
|style = "background-color: #3f00ff; color: #ffffff"|
|#3F00FF
|63
|0
|255
|255°
|100%
|100%
|}

"""

import wikitextparser as wtp
from html2text import html2text as htt
import re

text = wtp.parse(text).plain_text()
text = htt(text)

print ( text )

text = re.sub( r'{[^}]*}', '', text)  # erase everything in curly braces

print ( text )

[feacluster@micro wikipedia]$ python3 test.py
==Comparison of green, teal, blue and ultramarine == {| class="wikitable
sortable" style="width:100%" |- !Name !width=100|Color !HEX Code !Red !Green
!Blue !Hue !Sat !Lum |Ultramarine (Electric Ultramarine) |style = "background-
color: #3f00ff; color: #ffffff"| |#3F00FF |63 |0 |255 |255° |100% |100% |}


==Comparison of green, teal, blue and ultramarine ==


[feacluster@micro wikipedia]$

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions