Why BeautifulSoup add <html><body><p> to my results?

The problem

I have the following Page01.htm

<!DOCTYPE html><html lang="it-IT"><head>    <meta charset="utf-8">    <meta http-equiv="X-UA-Compatible" content="IE=Edge">    <head><title>Title here</title></head>
<body>



</body></html>

and I want to extract the informations inside the JSON between the the script tags with ID=TargetID.

What I’ve done

I wrote the following Python 3.6 code:

from bs4 import BeautifulSoup
import codecs

page_path="/Users/me/Page01.htm"

page = codecs.open(page_path, "r", "utf-8")

soup = BeautifulSoup(page.read(), "lxml")
vegas = soup.find_all(id="TargetID")

invalid_tags = ['script']
soup = BeautifulSoup(str(vegas),"lxml")
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()

JsonZ = str(soup)

Now, if I look inside vegas variable I can see

[ "name":"Kate", "age":22, "city":"Boston"} ]]> ]

but if I try to remove the script tags (using this answer script), I get the following JsonZ variable

'<html><body><p>[&lt;![CDATA[\n{ "name":"Kate", "age":22, "city":"Boston"}\n]]&gt;\n]</p></body></html>'

that have no script tags but have another 3 tags (<html><body><p>) completely unuseful.
My target is to get the following string { "name":"Kate", "age":22, "city":"Boston"} to load with Python JSON modules.

Solution:

BeautifulSoup will take practically anything give it and attempt to transform that into a complete page of HTML. That’s why you received '<html><body> ...'. Usually this is a good thing in that the HTML can be pretty badly formed yet BeautifulSoup will still process it.

In your case, one way of extracting that json would be like this.

>>> import bs4
>>> page = bs4.BeautifulSoup(open('Page01.htm').read(), 'lxml')
>>> first_script = page.select('#TargetID')[0].text
>>> first_script 
'<![CDATA[\n{ "name":"Kate", "age":22, "city":"Boston"}\n]]>\n'
>>> content = first_script[first_script.find('{'): 1+first_script.rfind('}')]
>>> content
'{ "name":"Kate", "age":22, "city":"Boston"}'

Once you have this you can turn it into a Python dictionary, like this.

>>> import json
>>> d = json.loads(content)
>>> d['name']
'Kate'
>>> d['age']
22
>>> d['city']
'Boston'

Preserving BeautifulSoup selection order

If I have a simple document like:

<p> hi </p>
<q> hello </q>
<p> bye </p>
<q> try </q>
<p> why </p>

And I store it in a BeautifulSoup object called doc, calling:

> doc.select('p, q')
[<p> hi </p>, <p> bye </p>, <p> why </p>, <q> hello </q>, <q> try </q>]

Is it possible to get these elements in the correct order? I would like to number these tags so that “hi” gets 1, “hello” gets 2 and so on… This is a minimal example, but in practice I will have to select by class, id and tag name.

Solution:

You can always use your own custom finding functions if the builtin methods don’t suit your use case.

def my_tag(tag):
    if tag.name in ('p', 'q'):
        return True

soup.find_all(my_tag)

The result would be

 [<p> hi </p>, <q> hello </q>, <p> bye </p>, <q> try </q>, <p> why </p>]