Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Accessing HTML source code using Python Selenium.
We can access HTML source code with Selenium WebDriver using two primary methods. The page_source method retrieves the complete HTML of the current page, while JavaScript execution allows us to access specific portions of the DOM, such as the body content.
Syntax
Following is the syntax for accessing HTML source using the page_source method −
src = driver.page_source
Following is the syntax for accessing HTML source using JavaScript execution −
h = driver.execute_script("return document.body.innerHTML")
Method 1: Using page_source Method
The page_source method returns the complete HTML source code of the current page as rendered by the browser. This includes the entire document from <html> to </html>, including all dynamically loaded content.
Example
Following example demonstrates how to access HTML source code using the page_source method −
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
# Setup Chrome driver with modern syntax
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in background
service = Service("path/to/chromedriver") # Update with your chromedriver path
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
driver.implicitly_wait(10)
driver.get("https://www.tutorialspoint.com/index.htm")
# Access complete HTML source code
page_source = driver.page_source
print("HTML Source Length:", len(page_source))
print("First 200 characters:")
print(page_source[:200])
finally:
driver.quit()
The output shows the length and first portion of the HTML source −
HTML Source Length: 45620 First 200 characters: <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Online Tutorials Library</title><meta name="description"
Method 2: Using JavaScript Execution
The execute_script method allows us to run JavaScript commands within the browser context. Using document.body.innerHTML returns only the content within the <body> tags, excluding the <head> section.
Example
Following example shows how to access HTML source using JavaScript execution −
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
# Setup Chrome driver
chrome_options = Options()
chrome_options.add_argument("--headless")
service = Service("path/to/chromedriver") # Update with your chromedriver path
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
driver.implicitly_wait(10)
driver.get("https://www.tutorialspoint.com/index.htm")
# Access body content using JavaScript
body_html = driver.execute_script("return document.body.innerHTML")
print("Body HTML Length:", len(body_html))
print("First 200 characters of body:")
print(body_html[:200])
# Access specific elements using JavaScript
title = driver.execute_script("return document.title")
print("Page Title:", title)
finally:
driver.quit()
The output displays the body content and page title −
Body HTML Length: 42150 First 200 characters of body: <div class="header"><div class="container"><div class="row"><div class="col-md-12"><nav class="navbar navbar-default"><div class="navbar-header"><button type="button" class="navbar-toggle Page Title: Online Tutorials Library
Advanced JavaScript Execution
JavaScript execution provides more flexibility for accessing specific parts of the DOM or performing complex operations.
Example − Accessing Multiple DOM Properties
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
service = Service("path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
driver.get("https://www.tutorialspoint.com/index.htm")
# Execute multiple JavaScript commands
scripts = {
"full_html": "return document.documentElement.outerHTML",
"head_content": "return document.head.innerHTML",
"body_content": "return document.body.innerHTML",
"page_title": "return document.title",
"url": "return window.location.href"
}
results = {}
for key, script in scripts.items():
results[key] = driver.execute_script(script)
print(f"{key.replace('_', ' ').title()}: {len(results[key]) if isinstance(results[key], str) else results[key]}")
finally:
driver.quit()
The output shows various DOM properties and their lengths −
Full Html: 45620 Head Content: 2890 Body Content: 42150 Page Title: Online Tutorials Library Url: https://www.tutorialspoint.com/index.htm
Comparison of Methods
Following table compares the two methods for accessing HTML source code −
| Method | Coverage | Use Case | Performance |
|---|---|---|---|
page_source |
Complete HTML document | Full page analysis, saving complete HTML | Single method call |
execute_script |
Specific DOM elements | Targeted content extraction, custom operations | Flexible but requires JavaScript knowledge |
page_source |
Includes DOCTYPE and html tags | Complete document structure needed | Direct WebDriver method |
execute_script |
Can access any DOM property | Dynamic content, computed styles | Executes in browser context |
Error Handling and Best Practices
When accessing HTML source code, it is important to handle potential errors and follow best practices for reliable automation.
Example − Robust Implementation
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException
import time
def get_html_source(url, method="page_source"):
chrome_options = Options()
chrome_options.add_argument("--headless")
service = Service("path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options)
try:
driver.implicitly_wait(10)
driver.get(url)
# Wait for page to load completely
time.sleep(2)
if method == "page_source":
return driver.page_source
elif method == "javascript":
return driver.execute_script("return document.body.innerHTML")
else:
raise ValueError("Method must be 'page_source' or 'javascript'")
except WebDriverException as e:
print(f"WebDriver error: {e}")
return None
except Exception as e:
print(f"Error: {e}")
return None
finally:
driver.quit()
# Usage example
if __name__ == "__main__":
url = "https://www.tutorialspoint.com/index.htm"
# Get full HTML
full_html = get_html_source(url, "page_source")
if full_html:
print(f"Full HTML retrieved: {len(full_html)} characters")
# Get body HTML
body_html = get_html_source(url, "javascript")
if body_html:
print(f"Body HTML retrieved: {len(body_html)} characters")
This implementation includes proper error handling and resource cleanup −
Full HTML retrieved: 45620 characters Body HTML retrieved: 42150 characters
Conclusion
Selenium WebDriver provides two main approaches to access HTML source code: the page_source method for complete HTML retrieval and execute_script for targeted DOM access using JavaScript. Choose page_source for full document analysis and execute_script for specific content extraction or dynamic operations.
