Remove a tag using BeautifulSoup but keep its contents

Question

Currently I have code that does something like this:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?

slacy · Accepted Answer · 2011-12-09 20:15:06Z

86

Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). So, you could do something like this:

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.)

edited Dec 9, 2011 at 20:15

answered Dec 9, 2011 at 0:47

slacy

11.9k9 gold badges59 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jared Over a year ago

This is awesome! Any idea on how I'd be able to add a space? I tried concatenating a ' ' after match before .replaceWithChildren(), but I can't figure it out. Thanks!

Steven Potter Over a year ago

I like the simplicity. Just a note, the replaceWithChildren() method has been replaced with unwrap() in BS4

user94154 Over a year ago

Is there a way to do this by specifying only valid tags?

Jesse Dhillon · Accepted Answer · 2012-07-12 23:48:53Z

64

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

edited Jul 12, 2012 at 23:48

answered Jul 12, 2010 at 3:25

Jesse Dhillon

8,0451 gold badge36 silver badges34 bronze badges

2 Comments

Jesse Dhillon Over a year ago

There was a bug here, introduced by an edit made by another user. You have to pass unicode strings on each call.

duhaime Over a year ago

maximum recursion depth exceeded :/

corford · Accepted Answer · 2014-03-11 00:39:46Z

21

Although this has already been mentoned by other people in the comments, I thought I'd post a full answer showing how to do it with Mozilla's Bleach. Personally, I think this is a lot nicer than using BeautifulSoup for this.

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

edited Mar 11, 2014 at 0:39

answered Oct 20, 2012 at 15:22

corford

1,04511 silver badges12 bronze badges

6 Comments

Jared Over a year ago

Can you have it remove tags selectively?

corford Over a year ago

You can pass a whitelist of tags (as a list, tuple or other iterable) that you deem acceptable and bleach will remove/escape everything else (which is a lot safer than the inverse, specifying a blacklist). See here for more info: bleach.readthedocs.org/en/latest/clean.html#tag-whitelist

Jared Over a year ago

Awesome! I missed this comment and have been stressing over this for a few days, hah!

Jared Over a year ago

Sorry to keep coming back to you on this, but how do I set a whitelist? I have the tags PRESOL, DATE, etc and tried this code: attrs = {'PRESOL':'DATE'} clean = bleach.clean(s2, attributes = attrs, strip=True) to no avail.

corford Over a year ago

Hi Jared. I think you might be getting mixed up with tags and attributes.

|

Etienne · Accepted Answer · 2012-07-11 14:35:43Z

11

I have a simpler solution but I don't know if there's a drawback to it.

UPDATE: there's a drawback, see Jesse Dhillon's comment. Also, another solution will be to use Mozilla's Bleach instead of BeautifulSoup.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

This will also print <div><p>Hello there my friend!</p></div> as desired.

edited Jul 11, 2012 at 14:35

answered Nov 20, 2009 at 3:43

Etienne

12.6k5 gold badges48 silver badges53 bronze badges

5 Comments

Pavel Vlasov Over a year ago

That code needs to be enhanced yet. It leaves the <p> untouched in case VALID_TAGS = 'b'

Etienne Over a year ago

I fixed the code, VALID_TAGS wasn't a list but it should have.

Jesse Dhillon Over a year ago

This was my first attempt. It does not work if invalid tags are nested within other tags; you are not iterating the children of the tree, so your example only works for trees where depth == 1. Try your code with the example in my answer above.

Etienne Over a year ago

@JesseDhillon Look likes you're totally right! Your answer look like the good one but, unfortunately, when I try it, with your html, I get the same error as xralf (I'm using version 3.0.8.1)? The slacy's solution works for me but the drawback is that's not possible to specify only the valid tags (and maybe the speed).

Jesse Dhillon Over a year ago

@Etienne -- I fixed it. Another user had made an edit to the code which caused a bug.

jimmy · Accepted Answer · 2013-12-23 06:08:05Z

8

you can use soup.text

.text removes all tags and concatenate all text.

answered Dec 23, 2013 at 6:08

jimmy

1221 silver badge2 bronze badges

Comments

Alex Martelli · Accepted Answer · 2009-11-19 19:53:01Z

7

You'll presumably have to move tag's children to be children of tag's parent before you remove the tag -- is that what you mean?

If so, then, while inserting the contents in the right place is tricky, something like this should work:

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

with the example value, this prints <div><p>Hello there my friend!</p></div> as desired.

edited Nov 19, 2009 at 19:53

answered Nov 19, 2009 at 19:42

Alex Martelli

889k175 gold badges1.3k silver badges1.4k bronze badges

2 Comments

Jason Christa Over a year ago

I still want value = "Hello <div>there</div> my friend!" to be valid.

Alex Martelli Over a year ago

@Jason, apart from needing an outermost tag, the string you give is perfectly valid and comes out unchanged from the code I give, so I have absolutely no idea what your comment is about!

Bishwas Mishra · Accepted Answer · 2016-12-26 09:11:30Z

3

Use unwrap.

Unwrap will remove one of multiple occurrence of the tag and still keep the contents.

Example:

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>')
>> soup
<html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html>
>> soup.nobr.unwrap
<nobr></nobr>
>> soup
>> <html><body><p>Hi. This is a nobr </p></body></html>

answered Dec 26, 2016 at 9:11

Bishwas Mishra

1,3421 gold badge13 silver badges25 bronze badges

Comments

Olof Sjöbergh · Accepted Answer · 2013-04-22 10:04:54Z

None of the proposed answered seemed to work with BeautifulSoup for me. Here's a version that works with BeautifulSoup 3.2.1, and also inserts a space when joining content from different tags instead of concatenating words.

def strip_tags(html, whitelist=[]):
    """
    Strip all HTML tags except for a list of whitelisted tags.
    """
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name not in whitelist:
            tag.append(' ')
            tag.replaceWithChildren()

    result = unicode(soup)

    # Clean up any repeated spaces and spaces like this: '<a>test </a> '
    result = re.sub(' +', ' ', result)
    result = re.sub(r' (<[^>]*> )', r'\1', result)
    return result.strip()

Example:

strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a'])
# result: u'<a>test</a> testing again'

robus gauli · Accepted Answer · 2016-09-25 17:13:35Z

2

Here is the better solution without any hassles and boilerplate code to filter out the tags keeping the content.Lets say you want to remove any children tags within the parent tag and just want to keep the contents/text then,you can simply do:

for p_tags in div_tags.find_all("p"):
    print(p_tags.get_text())

That's it and you can be free with all the br or i b tags within the parent tags and get the clean text.

answered Sep 25, 2016 at 17:13

robus gauli

3914 silver badges2 bronze badges

Comments

Dom DaFonte · Accepted Answer · 2019-06-01 14:04:25Z

2

Here is a python 3 friendly version of this function:

from bs4 import BeautifulSoup, NavigableString
invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

answered Jun 1, 2019 at 14:04

Dom DaFonte

1,7892 gold badges18 silver badges34 bronze badges

Comments

Tommz · Accepted Answer · 2015-03-12 01:51:11Z

1

This is an old question, but just to say of a better ways to do it. First of all, BeautifulSoup 3* is no longer being developed, so you should rather use BeautifulSoup 4*, so called bs4.

Also, lxml has just function that you need: Cleaner class has attribute remove_tags, which you can set to tags that will be removed while their content getting pulled up into the parent tag.

answered Mar 12, 2015 at 1:51

Tommz

3,4837 gold badges35 silver badges44 bronze badges

Comments

Thom Ives · Accepted Answer · 2023-08-24 21:43:32Z

What Worked For Me On Python 3.10 With BS4 And Unwrap

I initially liked Jesse Dhillon's answer a lot. However, I kept running into issues with the recursive calls due to recalling of the parser in BS4. I tried to change the level of recursion, but I kept running into problems with that too.

Then I looked into applying Bishwas Mishra's answer. Due to changes in BS4, I had to modify his code a bit, and I finally was able to develop a piece of code that would remove tags and maintain content.

I hope this helps some others.

from bs4 import BeautifulSoup


html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"

soup = BeautifulSoup(html, "html5lib")

for c in ["html", "head", "body", "b", "i", "u"]:
    while soup.find(c):
        exec(f"soup.{c}.unwrap()")

print(soup)

NOTE: It is necessary to add "html", "head", and "body" to the invalid tags list, because BS4 will add those into your html text if they were not originally there, and I did not want them for my specific case.

The output I got from the above code was ...

<p>Good, bad, and ugly</p>

Collectives™ on Stack Overflow

Remove a tag using BeautifulSoup but keep its contents

12 Answers 12

3 Comments

2 Comments

6 Comments

5 Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

What Worked For Me On Python 3.10 With BS4 And Unwrap

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

3 Comments

2 Comments

6 Comments

5 Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

What Worked For Me On Python 3.10 With BS4 And Unwrap

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related