{"id":818,"date":"2022-07-25T23:05:00","date_gmt":"2022-07-25T17:35:00","guid":{"rendered":"https:\/\/geekpython.in\/?p=818"},"modified":"2023-08-15T16:43:31","modified_gmt":"2023-08-15T11:13:31","slug":"web-scraping-in-python-using-beautifulsoup","status":"publish","type":"post","link":"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup","title":{"rendered":"Web Scraping In Python Using Beautifulsoup"},"content":{"rendered":"\n<p>The Internet is filled with lots of digital data that we might need for research or for personal interest. In order to get these data, we gonna need some&nbsp;<strong>web scraping<\/strong>&nbsp;skills.<\/p>\n\n\n\n<p>Python has enough powerful tools to carry out web scraping tasks easily and effectively on large data.<\/p>\n\n\n\n<p>In this tutorial, we are going to use&nbsp;<code><strong>requests<\/strong><\/code>&nbsp;and&nbsp;<code><strong>beautifulsoup<\/strong><\/code>&nbsp;libraries provided by Python.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-what-is-web-scraping\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-what-is-web-scraping\"><\/a>What is web scraping?<\/h1>\n\n\n\n<p><strong>Web scraping<\/strong>&nbsp;or&nbsp;<strong>web data extraction<\/strong>&nbsp;is the process of gathering information from the Internet. It can be a simple copy-paste of the data from specific websites or it can be an advanced data collection from websites that has real-time data.<\/p>\n\n\n\n<p>Some websites don&#8217;t mind extracting their data while some websites strictly prohibit data extraction on their websites.<\/p>\n\n\n\n<p>If you are scraping websites for educational purposes then you&#8217;re likely to not have any problem but if you are starting large-scale projects then be sure to check the website&#8217;s&nbsp;<a target=\"_blank\" href=\"https:\/\/benbernardblog.com\/web-scraping-and-crawling-are-perfectly-legal-right\/\" rel=\"noreferrer noopener\">Terms of Services<\/a>.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-why-do-we-need-it\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-why-do-we-need-it\"><\/a>Why do we need it?<\/h1>\n\n\n\n<p>Not all websites have APIs to fetch content, so to extract the content, we just left with only one option and that is to scrape the content.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-steps-for-web-scraping\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-steps-for-web-scraping\"><\/a>Steps for web scraping<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Inspecting the source of data<\/li>\n\n\n\n<li>Getting the HTML content<\/li>\n\n\n\n<li>Parsing the HTML with Beautifulsoup<\/li>\n<\/ul>\n\n\n\n<p>Now let&#8217;s move ahead and install the dependencies we&#8217;ll need for this tutorial.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-installing-the-dependencies\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-installing-the-dependencies\"><\/a>Installing the dependencies<\/h1>\n\n\n\n<p>We are going to install the&nbsp;<code><strong>requests<\/strong><\/code>&nbsp;library that helps us to get the HTML content of the website and&nbsp;<code><strong>beautifulsoup4<\/strong><\/code>&nbsp;that parses the HTML.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:ps decode:true \">pip install requests beautifulsoup4<\/pre><\/div>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-scraping-the-website\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-scraping-the-website\"><\/a>Scraping the website<\/h1>\n\n\n\n<p>We are going to scrape the Wikipedia article on&nbsp;<a target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\" rel=\"noreferrer noopener\">Python Programming Language<\/a>. This webpage contains almost all types of HTML tags which will be good for us to test all aspects of BeautifulSoup.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-1-inspecting-the-source-of-data\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-1-inspecting-the-source-of-data\"><\/a>1. Inspecting the source of data<\/h2>\n\n\n\n<p>Before writing any Python code,&nbsp;<strong>you must take a good look at the website you are going to perform web scraping<\/strong>.<\/p>\n\n\n\n<p>You need to understand the structure of the website to extract the relevant information for the project.<\/p>\n\n\n\n<p>Thoroughly, go through the website, perform basic actions, understand how the website works, and check the URLs, routes, query parameters, etc.<\/p>\n\n\n\n<p><strong>Inspecting the webpage using Developer Tools<\/strong><\/p>\n\n\n\n<p>Now, it&#8217;s time to inspect the&nbsp;<a target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Document_Object_Model\" rel=\"noreferrer noopener\">DOM<\/a>&nbsp;(Document Object Model) of the website using Developer Tools.<\/p>\n\n\n\n<p><strong>Developer Tools<\/strong>&nbsp;help in understanding the structure of the website. It is capable of doing a range of things, from inspecting the loaded HTML, CSS, and JavaScript to showing the assets the page has requested and how long they took to load. All modern browsers come with Developer Tools installed.<\/p>\n\n\n\n<p>To open&nbsp;<strong><em>dev tools<\/em><\/strong>&nbsp;simply&nbsp;<strong>right-click on the webpage<\/strong>&nbsp;and click on the&nbsp;<strong><em>Inspect<\/em><\/strong>&nbsp;option.&nbsp;<strong>This process is for the Chrome browser on Windows or simply apply the following keyboard shortcut &#8211;<\/strong><\/p>\n\n\n\n<p><em><kbd>Ctrl<\/kbd>&nbsp;+&nbsp;<kbd>Shift<\/kbd>&nbsp;+&nbsp;<kbd>I<\/kbd><\/em><\/p>\n\n\n\n<p>For&nbsp;<strong>macOS<\/strong>, I think the command is &#8211;<\/p>\n\n\n\n<p><em><kbd>\u2318<\/kbd>&nbsp;+&nbsp;<kbd>\u2325<\/kbd>&nbsp;+&nbsp;<kbd>I<\/kbd><\/em><\/p>\n\n\n\n<p>Now it&#8217;s time to look at the DOM of our webpage that we are going to scrape.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1347\" height=\"426\" src=\"https:\/\/geekpython.in\/wp-content\/uploads\/2023\/08\/dompage.png\" alt=\"DOM View\" class=\"wp-image-820\"\/><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong><em>We can see the HTML on the right that represents the structure of the page which we can see on the left side.<\/em><\/strong><\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-2-get-the-html-content\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-2-get-the-html-content\"><\/a>2. Get the HTML content<\/h2>\n\n\n\n<p>We need&nbsp;<code><strong>requests<\/strong><\/code>&nbsp;library to scrape the HTML content of the website which we already installed in our system.<\/p>\n\n\n\n<p>Next, open up your favorite IDE or Code Editor and retrieve the site&#8217;s HTML in just a few lines of Python code.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\n# Step 1: Get the HTML\nr = requests.get(url)\nhtmlContent = r.content\n\n# Getting the content as bytes\nprint(htmlContent)\n\n# Getting the encoded content\nprint(r.text)<\/pre><\/div>\n\n\n\n<p>If we print the&nbsp;<code><strong>r.text<\/strong><\/code>&nbsp;we&#8217;ll get the same output as the HTML we inspected earlier with the browser&#8217;s developer tools. Now we have access to the site&#8217;s HTML in our Python script.<\/p>\n\n\n\n<p>Now let&#8217;s parse the HTML using Beautiful Soup<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"heading-3-parse-the-html-with-beautifulsoup\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-3-parse-the-html-with-beautifulsoup\"><\/a>3. Parse the HTML with Beautifulsoup<\/h2>\n\n\n\n<p>We have successfully scraped the HTML of the website but there is a problem if we look at it there are so many HTML elements lying here and there, and attributes and tags are scattered around. So we need to parse that lengthy response using Python code to make it more readable and accessible.<\/p>\n\n\n\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\" target=\"_blank\">Beautiful Soup<\/a>&nbsp;helps us to&nbsp;<strong>parse the structured data<\/strong>. It is a Python library for pulling out data from <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-luminous-vivid-orange-color\">HTML and XML files<\/mark>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\n# Step 1: Get the HTML\nr = requests.get(url)\ncontent = r.content\n\n# Step 2: Parse the HTML\nsoup = BeautifulSoup(content, 'html.parser')\nprint(soup)<\/pre><\/div>\n\n\n\n<p>Here we added some lines to our previous code. We added an import statement for Beautiful Soup and then created a Beautiful Soup object that takes the&nbsp;<code><strong>content<\/strong><\/code>&nbsp;which holds the value of&nbsp;<code><strong>r.content<\/strong><\/code>.<\/p>\n\n\n\n<p>The second argument we added in our Beautiful Soup object is&nbsp;<code><strong>html.parser<\/strong><\/code>. You must choose the&nbsp;<a target=\"_blank\" href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#differences-between-parsers\" rel=\"noreferrer noopener\">right parser<\/a>&nbsp;for the HTML content.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-find-elements-by-id\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-find-elements-by-id\"><\/a>Find elements by ID<\/h3>\n\n\n\n<p>Elements in an HTML webpage can have an&nbsp;<strong><em>id<\/em><\/strong>&nbsp;attribute assigned to them. It makes an element in the page uniquely identifiable.<\/p>\n\n\n\n<p>Beautiful Soup allows us to find the specific HTML element by its ID<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\nr = requests.get(url)\ncontent = r.content\n\nsoup = BeautifulSoup(content, 'html.parser')\n\nid_content = soup.find(id=\"firstHeading\")<\/pre><\/div>\n\n\n\n<p>We can use&nbsp;<code><strong>.prettify()<\/strong><\/code>&nbsp;to any beautiful soup object to prettify the HTML for easier viewing. Here we called&nbsp;<code><strong>.prettify()<\/strong><\/code>&nbsp;on&nbsp;<code><strong>id_content<\/strong><\/code>&nbsp;variable from above.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">print(id_content.prettify())<\/pre><\/div>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Note: We cannot use&nbsp;<code><strong>.prettify()<\/strong><\/code>&nbsp;when using&nbsp;<code><strong>.find_all()<\/strong><\/code>&nbsp;method.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-find-elements-by-tag\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-find-elements-by-tag\"><\/a>Find elements by Tag<\/h3>\n\n\n\n<p>In an HTML webpage, we encounter lots of HTML tags and we might want the data that resides in those tags. Like we want the hyperlinks that reside in the&nbsp;<code><strong>\"a\"<\/strong><\/code>&nbsp;(anchor) tag or want to scrape the description from&nbsp;<code><strong>\"p\"<\/strong><\/code>&nbsp;(paragraph) tag.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\nr = requests.get(url)\ncontent = r.content\n\nsoup = BeautifulSoup(content, 'html.parser')\n\n# Getting the first &lt;code&gt; tag\nfind_tag = soup.find(\"code\")\nprint(find_tag.prettify())\n\n# Getting all the &lt;pre&gt; tag\nall_pre_tag = soup.find_all(\"pre\")\n\nfor pre_tag in all_pre_tag:\n    print(pre_tag)<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-find-elements-by-html-class-name\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-find-elements-by-html-class-name\"><\/a>Find elements by HTML Class Name<\/h3>\n\n\n\n<p>We can see hundreds of elements like&nbsp;<code>&lt;div&gt;<\/code>,&nbsp;<code>&lt;p&gt;<\/code>&nbsp;or&nbsp;<code>&lt;a&gt;<\/code>&nbsp;with some&nbsp;<strong>classes<\/strong>&nbsp;in an HTML webpage, and&nbsp;<strong>through these classes, we can access the whole content present inside the specific element<\/strong>.<\/p>\n\n\n\n<p>Beautiful Soup provides a&nbsp;<code><strong>class_<\/strong><\/code>&nbsp;argument to find the content present inside an element with a specified class name.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\nr = requests.get(url)\ncontent = r.content\n\nsoup = BeautifulSoup(content, 'html.parser')\n\n# Getting the \"div\" element with class name \"mw-highlight\"\nclass_elem = soup.find(\"div\", class_=\"mw-highlight\")\nprint(class_elem.prettify())<\/pre><\/div>\n\n\n\n<p>The first argument we provided inside the beautiful soup object is the&nbsp;<strong><em>element<\/em><\/strong>&nbsp;and the second argument we provided is the&nbsp;<strong><em>class name<\/em><\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-find-elements-by-text-content-and-class-name\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-find-elements-by-text-content-and-class-name\"><\/a>Find elements by Text Content and Class name<\/h3>\n\n\n\n<p>Beautiful Soup provides a&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#the-string-argument\" target=\"_blank\"><strong><em>string argument<\/em><\/strong><\/a>&nbsp;that allows us to search for a string instead of a tag. We can pass in&nbsp;<em>a string, a regular expression, a list, a function, or the value True<\/em>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># Getting all the strings whose value is \"Python\"\nfind_str = soup.find_all(string=\"Python\")\nprint(find_str)\n\n.........\n['Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python']<\/pre><\/div>\n\n\n\n<p>We can also find the&nbsp;<strong>tags<\/strong>&nbsp;whose value matches the&nbsp;<strong>specified value for the string argument<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">find_str_tag = soup.find_all(\"p\", string=\"Python\")<\/pre><\/div>\n\n\n\n<p>Here we are looking for the&nbsp;<code>&lt;p&gt;<\/code>&nbsp;tag in which the value&nbsp;<em>&#8220;Python&#8221;<\/em>&nbsp;must be there. But if we move ahead and&nbsp;<strong>try to print the result, then we&#8217;ll get an empty result<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">print(find_str_tag)\n.........\n[]<\/pre><\/div>\n\n\n\n<p>This is because when we use&nbsp;<em>string=<\/em>&nbsp;then our program looks exactly the same value as we provide. Any customization, whitespace, difference in spelling, or capitalization will prevent the element from matching.<\/p>\n\n\n\n<p>If we provide the exact value then the program will run successfully.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">find_str_tag = soup.find_all(\"span\", string=\"Typing\")\nprint(find_str_tag)\n\n.........\n[&lt;span class=\"toctext\"&gt;Typing&lt;\/span&gt;, &lt;span class=\"mw-headline\" id=\"Typing\"&gt;Typing&lt;\/span&gt;]<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-passing-a-function\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-passing-a-function\"><\/a>Passing a Function<\/h3>\n\n\n\n<p>In the above section, when we try to find the&nbsp;<code>&lt;p&gt;<\/code>&nbsp;tag containing the string&nbsp;<em>&#8220;Python&#8221;<\/em>&nbsp;we got disappointment.<\/p>\n\n\n\n<p>But Beautiful Soup allows us to pass a function as arguments. We can modify the above code to work perfectly fine after using the function.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># Creating a function\ndef has_python(text):\n    return text in soup.find_all(\"p\")\n\nfind_str_tag = soup.find_all(\"p\", string=has_python(\"Python\"))\nprint(len(find_str_tag))<\/pre><\/div>\n\n\n\n<p>Here we created a function called&nbsp;<code><strong>has_python<\/strong><\/code>&nbsp;which takes&nbsp;<code>text<\/code>&nbsp;as an argument and then it returns that&nbsp;<em>text<\/em>&nbsp;present in all the&nbsp;<code>&lt;p&gt;<\/code>&nbsp;tag.<\/p>\n\n\n\n<p>Next, we passed that&nbsp;<strong>function<\/strong>&nbsp;to the&nbsp;<strong><em>string argument<\/em><\/strong>&nbsp;and pass the string&nbsp;<strong><em>&#8220;Python&#8221;<\/em><\/strong>&nbsp;to it. Then we printed the number of occurrences of the&nbsp;<strong><em>&#8220;Python&#8221;<\/em><\/strong>&nbsp;in all the&nbsp;<code>&lt;p&gt;<\/code>&nbsp;tags.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">81<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-extract-text-from-html-elements\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-extract-text-from-html-elements\"><\/a>Extract Text from HTML elements<\/h3>\n\n\n\n<p>What if we do not want the content with the HTML tags attached to them? What if we want clean and simple text data from the elements and tags?<\/p>\n\n\n\n<p>We can use&nbsp;<code><strong>.text<\/strong><\/code>&nbsp;or&nbsp;<code><strong>.get_text()<\/strong><\/code>&nbsp;to return&nbsp;<strong>only the text content<\/strong>&nbsp;of the HTML elements that we pass in the Beautiful Soup object.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\nr = requests.get(url)\ncontent = r.content\n\nsoup = BeautifulSoup(content, 'html.parser')\n\ntable_elements = soup.find_all(\"table\", class_=\"wikitable\")\n\nfor table_data in table_elements:\n    table_body = table_data.find(\"tbody\")\n\n    print(table_body.text) # or\n\n    print(table_body.get_text())<\/pre><\/div>\n\n\n\n<p>We&#8217;ll get the whole table as an output in text format. But there will be so many&nbsp;<strong>whitespaces<\/strong>&nbsp;between the text so we&#8217;ll need to strip that data and remove the whitespaces by simply using&nbsp;<code><strong>.strip<\/strong><\/code>&nbsp;<strong>method<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">print(table_body.text.strip())<\/pre><\/div>\n\n\n\n<p>There are other ways also to remove whitespaces. Check it out&nbsp;<a target=\"_blank\" href=\"https:\/\/geekpython.in\/ways-to-remove-whitespaces-from-the-string-in-python-with-examples-beginner-s-guide\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-extract-attributes-from-html-elements\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-extract-attributes-from-html-elements\"><\/a>Extract Attributes from HTML elements<\/h3>\n\n\n\n<p>An HTML page has numerous attributes like&nbsp;<strong><em>href, src, style, title, and more<\/em><\/strong>. Since an HTML webpage contains a large amount of&nbsp;<code>&lt;a&gt;<\/code>&nbsp;tags with&nbsp;<strong><em>href<\/em><\/strong>&nbsp;attributes so we are going to scrape all the&nbsp;<em>href<\/em>&nbsp;attributes present on our website.<\/p>\n\n\n\n<p>We cannot scrape the attributes as we did in the above examples.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># Accessing href in the main content of the HTML page\nanchor_in_body_content = soup.find(id=\"bodyContent\")\n\n# Finding all the anchor tags\nanchors = anchor_in_body_content.find_all(\"a\")\n\n# Looping over all the anchor tags to get the href attribute\nfor link in anchors:\n    links = link.get('href')\n    print(links)<\/pre><\/div>\n\n\n\n<p>We simply looped over all the&nbsp;<code>&lt;a&gt;<\/code>&nbsp;tags in the main content of the HTML page and then used a&nbsp;<code><strong>.get('href')<\/strong><\/code>&nbsp;to get all the href attributes.<\/p>\n\n\n\n<p>You can do the same for the&nbsp;<code><strong>src<\/strong><\/code>&nbsp;attributes also.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \"># Accessing src in body of the HTML page\nimg_in_body_content = soup.find(id=\"bodyContent\")\n\n# Finding all the img tags\nmedia = img_in_body_content.find_all(\"img\")\n\n# Looping over all the img tags to get the src attribute\nfor img in media:\n    images = img.get('src')\n    print(images)<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-access-parent-and-sibling-elements\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-access-parent-and-sibling-elements\"><\/a>Access Parent and Sibling elements<\/h3>\n\n\n\n<p>Beautiful Soup allows us to access an element&#8217;s parent by just using&nbsp;<code><strong>.parent<\/strong><\/code>&nbsp;attribute.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\nr = requests.get(url)\ncontent = r.content\n\nsoup = BeautifulSoup(content, 'html.parser')\n\nid_content = soup.find(id=\"cite_ref-123\")\n\nparent_elem = id_content.parent\nprint(parent_elem)<\/pre><\/div>\n\n\n\n<p>We can&nbsp;<strong>find grandparent<\/strong>&nbsp;or&nbsp;<strong>great-grandparent elements<\/strong>&nbsp;of an specific element passed in the beautiful soup object.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">id_content = soup.find(id=\"cite_ref-123\")\n\ngrandparent_elem = id_content.parent.parent\nprint(grandparent_elem)<\/pre><\/div>\n\n\n\n<p>There is another method that Beautiful Soup provides is&nbsp;<code><strong>.parents<\/strong><\/code>&nbsp;which helps us in&nbsp;<strong>iterating over all of an element&#8217;s parents<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">id_content = soup.find(id=\"cite_ref-123\")\n\nfor elem in id_content.parents:\n    print(elem) # to print the elements\n\n    print(elem.name) # to print only the names of elements<\/pre><\/div>\n\n\n\n<p><strong>Note: This program might take a little time to complete so wait until the program is finished.<\/strong><\/p>\n\n\n\n<p><strong>Output for<\/strong>&nbsp;<code><strong>elem.name<\/strong><\/code>&nbsp;<strong>would be<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">p\ndiv\ndiv\ndiv\ndiv\nbody\nhtml\n[document]<\/pre><\/div>\n\n\n\n<p>Similarly, we can access an element&#8217;s&nbsp;<strong>next<\/strong>&nbsp;and&nbsp;<strong>previous siblings<\/strong>&nbsp;by using&nbsp;<code><strong>.next_sibling<\/strong><\/code>&nbsp;and&nbsp;<code><strong>.previous_sibling<\/strong><\/code>&nbsp;respectively.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">id_content = soup.find(id=\"cite_ref-123\")\n\n# To print the next sibling of an element\nnext_sibling_elem = id_content.next_sibling\n\nprint(next_sibling_elem)<\/pre><\/div>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">id_content = soup.find(id=\"cite_ref-123\")\n\n# To print the previous sibling of an element\nprevious_sibling_elem = id_content.previous_sibling\n\nprint(previous_sibling_elem)<\/pre><\/div>\n\n\n\n<p>Iterating over a tag&#8217;s siblings using&nbsp;<code><strong>.next_siblings<\/strong><\/code>&nbsp;or&nbsp;<code><strong>.previous_siblings<\/strong><\/code>.<\/p>\n\n\n\n<p><strong>Iterating over all the next siblings<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">next_sibling_elem = id_content.next_sibling\n\nfor next_elem in id_content.next_siblings:\n    print(next_elem)<\/pre><\/div>\n\n\n\n<p><strong>Iterating over all the previous siblings<\/strong><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">id_content = soup.find(id=\"cite_ref-123\")\n\nfor previous_elem in id_content.previous_siblings:\n    print(previous_elem)<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"heading-using-regular-expression\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-using-regular-expression\"><\/a>Using Regular Expression<\/h3>\n\n\n\n<p>Last but not least, we can use&nbsp;<strong>regular expression<\/strong>&nbsp;to search for an&nbsp;<em>element<\/em>,&nbsp;<em>tag<\/em>,&nbsp;<em>text<\/em>, etc., in the HTML tree.<\/p>\n\n\n\n<p>This code will find all the tags starting from&nbsp;<code><strong>p<\/strong><\/code>&nbsp;in the HTML element having&nbsp;<code><strong>id=bodyContent<\/strong><\/code><\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">import requests\nfrom bs4 import BeautifulSoup\nimport re\n\nurl = \"https:\/\/en.wikipedia.org\/wiki\/Python_(programming_language)\"\n\nr = requests.get(url)\ncontent = r.content\n\nsoup = BeautifulSoup(content, 'html.parser')\n\nid_content = soup.find(id=\"bodyContent\")\n\nfor tag in id_content.find_all(re.compile(\"^p\")):\n    print(tag.name)<\/pre><\/div>\n\n\n\n<p>This code will match all the alphanumeric characters, which means&nbsp;<code>a-z<\/code>,&nbsp;<code>A-Z<\/code>, and&nbsp;<code>0-9<\/code>. It also matches the underscore,&nbsp;<code>_<\/code>. But we don&#8217;t have elements starting from digits or underscore, so it will return all the tags and elements of an element passed in the Beautiful Soup object.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:python decode:true \">id_content = soup.find(id=\"bodyContent\")\n\nfor tag in id_content.find_all(re.compile(\"\\w\")):\n    print(tag.name)<\/pre><\/div>\n\n\n\n<h1 class=\"wp-block-heading\" id=\"heading-conclusion\"><a href=\"https:\/\/geekpython.in\/web-scraping-in-python-using-beautifulsoup#heading-conclusion\"><\/a>Conclusion<\/h1>\n\n\n\n<p>Well, we learned how to scrape a static website though it can be different for&nbsp;<strong>dynamic websites<\/strong>&nbsp;which throw different data on different requests, or&nbsp;<strong>hidden websites<\/strong>&nbsp;that have authentication. There are more powerful scraping tools available for these types of websites like&nbsp;<strong>Selenium<\/strong>,&nbsp;<strong>Scrapy<\/strong>, etc.<\/p>\n\n\n\n<p><code><strong>requests<\/strong><\/code>&nbsp;library allows us to access the site&#8217;s HTML which then can be helpful for us to pull out the data from HTML using&nbsp;<strong>Beautiful Soup<\/strong>.<\/p>\n\n\n\n<p>There are many methods and functions still available that we haven&#8217;t seen but we discussed some key functions and methods that are used most commonly.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>That&#8217;s all for now<\/strong><\/p>\n\n\n\n<p><strong>Keep Coding\u270c\u270c<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Internet is filled with lots of digital data that we might need for research or for personal interest. In order to get these data, we gonna need some&nbsp;web scraping&nbsp;skills. Python has enough powerful tools to carry out web scraping tasks easily and effectively on large data. In this tutorial, we are going to use&nbsp;requests&nbsp;and&nbsp;beautifulsoup&nbsp;libraries [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":821,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ocean_post_layout":"","ocean_both_sidebars_style":"","ocean_both_sidebars_content_width":0,"ocean_both_sidebars_sidebars_width":0,"ocean_sidebar":"0","ocean_second_sidebar":"0","ocean_disable_margins":"enable","ocean_add_body_class":"","ocean_shortcode_before_top_bar":"","ocean_shortcode_after_top_bar":"","ocean_shortcode_before_header":"","ocean_shortcode_after_header":"","ocean_has_shortcode":"","ocean_shortcode_after_title":"","ocean_shortcode_before_footer_widgets":"","ocean_shortcode_after_footer_widgets":"","ocean_shortcode_before_footer_bottom":"","ocean_shortcode_after_footer_bottom":"","ocean_display_top_bar":"default","ocean_display_header":"default","ocean_header_style":"","ocean_center_header_left_menu":"0","ocean_custom_header_template":"0","ocean_custom_logo":0,"ocean_custom_retina_logo":0,"ocean_custom_logo_max_width":0,"ocean_custom_logo_tablet_max_width":0,"ocean_custom_logo_mobile_max_width":0,"ocean_custom_logo_max_height":0,"ocean_custom_logo_tablet_max_height":0,"ocean_custom_logo_mobile_max_height":0,"ocean_header_custom_menu":"0","ocean_menu_typo_font_family":"0","ocean_menu_typo_font_subset":"","ocean_menu_typo_font_size":0,"ocean_menu_typo_font_size_tablet":0,"ocean_menu_typo_font_size_mobile":0,"ocean_menu_typo_font_size_unit":"px","ocean_menu_typo_font_weight":"","ocean_menu_typo_font_weight_tablet":"","ocean_menu_typo_font_weight_mobile":"","ocean_menu_typo_transform":"","ocean_menu_typo_transform_tablet":"","ocean_menu_typo_transform_mobile":"","ocean_menu_typo_line_height":0,"ocean_menu_typo_line_height_tablet":0,"ocean_menu_typo_line_height_mobile":0,"ocean_menu_typo_line_height_unit":"","ocean_menu_typo_spacing":0,"ocean_menu_typo_spacing_tablet":0,"ocean_menu_typo_spacing_mobile":0,"ocean_menu_typo_spacing_unit":"","ocean_menu_link_color":"","ocean_menu_link_color_hover":"","ocean_menu_link_color_active":"","ocean_menu_link_background":"","ocean_menu_link_hover_background":"","ocean_menu_link_active_background":"","ocean_menu_social_links_bg":"","ocean_menu_social_hover_links_bg":"","ocean_menu_social_links_color":"","ocean_menu_social_hover_links_color":"","ocean_disable_title":"default","ocean_disable_heading":"default","ocean_post_title":"","ocean_post_subheading":"","ocean_post_title_style":"","ocean_post_title_background_color":"","ocean_post_title_background":0,"ocean_post_title_bg_image_position":"","ocean_post_title_bg_image_attachment":"","ocean_post_title_bg_image_repeat":"","ocean_post_title_bg_image_size":"","ocean_post_title_height":0,"ocean_post_title_bg_overlay":0.5,"ocean_post_title_bg_overlay_color":"","ocean_disable_breadcrumbs":"default","ocean_breadcrumbs_color":"","ocean_breadcrumbs_separator_color":"","ocean_breadcrumbs_links_color":"","ocean_breadcrumbs_links_hover_color":"","ocean_display_footer_widgets":"default","ocean_display_footer_bottom":"default","ocean_custom_footer_template":"0","ocean_post_oembed":"","ocean_post_self_hosted_media":"","ocean_post_video_embed":"","ocean_link_format":"","ocean_link_format_target":"self","ocean_quote_format":"","ocean_quote_format_link":"post","ocean_gallery_link_images":"off","ocean_gallery_id":[],"footnotes":""},"categories":[2,45],"tags":[12,31,46],"class_list":["post-818","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","category-web-scraping","tag-python","tag-python3","tag-web-scraping","entry","has-media"],"_links":{"self":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/818","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/comments?post=818"}],"version-history":[{"count":3,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/818\/revisions"}],"predecessor-version":[{"id":1369,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/posts\/818\/revisions\/1369"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/media\/821"}],"wp:attachment":[{"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/media?parent=818"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/categories?post=818"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekpython.in\/wp-json\/wp\/v2\/tags?post=818"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}