Accessing HTML source code using Python Selenium.

We can access HTML source code with Selenium WebDriver using two primary methods. The page_source method retrieves the complete HTML of the current page, while JavaScript execution allows us to access specific portions of the DOM, such as the body content.

Syntax

Following is the syntax for accessing HTML source using the page_source method −

src = driver.page_source

Following is the syntax for accessing HTML source using JavaScript execution −

h = driver.execute_script("return document.body.innerHTML")

Method 1: Using page_source Method

The page_source method returns the complete HTML source code of the current page as rendered by the browser. This includes the entire document from <html> to </html>, including all dynamically loaded content.

Example

Following example demonstrates how to access HTML source code using the page_source method −

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# Setup Chrome driver with modern syntax
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in background
service = Service("path/to/chromedriver")  # Update with your chromedriver path
driver = webdriver.Chrome(service=service, options=chrome_options)

try:
    driver.implicitly_wait(10)
    driver.get("https://www.tutorialspoint.com/index.htm")
    
    # Access complete HTML source code
    page_source = driver.page_source
    print("HTML Source Length:", len(page_source))
    print("First 200 characters:")
    print(page_source[:200])
    
finally:
    driver.quit()

The output shows the length and first portion of the HTML source −

HTML Source Length: 45620
First 200 characters:
<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Online Tutorials Library</title><meta name="description"

Method 2: Using JavaScript Execution

The execute_script method allows us to run JavaScript commands within the browser context. Using document.body.innerHTML returns only the content within the <body> tags, excluding the <head> section.

Example

Following example shows how to access HTML source using JavaScript execution −

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# Setup Chrome driver
chrome_options = Options()
chrome_options.add_argument("--headless")
service = Service("path/to/chromedriver")  # Update with your chromedriver path
driver = webdriver.Chrome(service=service, options=chrome_options)

try:
    driver.implicitly_wait(10)
    driver.get("https://www.tutorialspoint.com/index.htm")
    
    # Access body content using JavaScript
    body_html = driver.execute_script("return document.body.innerHTML")
    print("Body HTML Length:", len(body_html))
    print("First 200 characters of body:")
    print(body_html[:200])
    
    # Access specific elements using JavaScript
    title = driver.execute_script("return document.title")
    print("Page Title:", title)
    
finally:
    driver.quit()

The output displays the body content and page title −

Body HTML Length: 42150
First 200 characters of body:
<div class="header"><div class="container"><div class="row"><div class="col-md-12"><nav class="navbar navbar-default"><div class="navbar-header"><button type="button" class="navbar-toggle
Page Title: Online Tutorials Library

Advanced JavaScript Execution

JavaScript execution provides more flexibility for accessing specific parts of the DOM or performing complex operations.

Example − Accessing Multiple DOM Properties

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
service = Service("path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options)

try:
    driver.get("https://www.tutorialspoint.com/index.htm")
    
    # Execute multiple JavaScript commands
    scripts = {
        "full_html": "return document.documentElement.outerHTML",
        "head_content": "return document.head.innerHTML",
        "body_content": "return document.body.innerHTML",
        "page_title": "return document.title",
        "url": "return window.location.href"
    }
    
    results = {}
    for key, script in scripts.items():
        results[key] = driver.execute_script(script)
        print(f"{key.replace('_', ' ').title()}: {len(results[key]) if isinstance(results[key], str) else results[key]}")
    
finally:
    driver.quit()

The output shows various DOM properties and their lengths −

Full Html: 45620
Head Content: 2890
Body Content: 42150
Page Title: Online Tutorials Library
Url: https://www.tutorialspoint.com/index.htm

Comparison of Methods

Following table compares the two methods for accessing HTML source code −

Method Coverage Use Case Performance
page_source Complete HTML document Full page analysis, saving complete HTML Single method call
execute_script Specific DOM elements Targeted content extraction, custom operations Flexible but requires JavaScript knowledge
page_source Includes DOCTYPE and html tags Complete document structure needed Direct WebDriver method
execute_script Can access any DOM property Dynamic content, computed styles Executes in browser context

Error Handling and Best Practices

When accessing HTML source code, it is important to handle potential errors and follow best practices for reliable automation.

Example − Robust Implementation

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException
import time

def get_html_source(url, method="page_source"):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    service = Service("path/to/chromedriver")
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    try:
        driver.implicitly_wait(10)
        driver.get(url)
        
        # Wait for page to load completely
        time.sleep(2)
        
        if method == "page_source":
            return driver.page_source
        elif method == "javascript":
            return driver.execute_script("return document.body.innerHTML")
        else:
            raise ValueError("Method must be 'page_source' or 'javascript'")
            
    except WebDriverException as e:
        print(f"WebDriver error: {e}")
        return None
    except Exception as e:
        print(f"Error: {e}")
        return None
    finally:
        driver.quit()

# Usage example
if __name__ == "__main__":
    url = "https://www.tutorialspoint.com/index.htm"
    
    # Get full HTML
    full_html = get_html_source(url, "page_source")
    if full_html:
        print(f"Full HTML retrieved: {len(full_html)} characters")
    
    # Get body HTML
    body_html = get_html_source(url, "javascript")
    if body_html:
        print(f"Body HTML retrieved: {len(body_html)} characters")

This implementation includes proper error handling and resource cleanup −

Full HTML retrieved: 45620 characters
Body HTML retrieved: 42150 characters

Conclusion

Selenium WebDriver provides two main approaches to access HTML source code: the page_source method for complete HTML retrieval and execute_script for targeted DOM access using JavaScript. Choose page_source for full document analysis and execute_script for specific content extraction or dynamic operations.

Updated on: 2026-03-16T21:38:54+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements