crawl4ai version
0.6.3
Expected Behavior
But Expected link is : https://www.adhirainternationalschool.co.in/pdf/Comprehensive Information.pdf
Because of result.link containing incorrect relative link, the links in bff_strategy.py also resolves to incorrect one in normalize_url_for_deep_crawl even though it is trying to using urljoin.
def normalize_url_for_deep_crawl(href, base_url):
"""Normalize URLs to ensure consistent format"""
from urllib.parse import urljoin, urlparse, urlunparse, parse_qs, urlencode
# Handle None or empty values
if not href:
return None
# Use urljoin to handle relative URLs
full_url = urljoin(base_url, href.strip())
....
I think we need to use urljoin where the links are extracted itself. So that the result.links['internal'][x].href is correct in the first place.
Current Behavior
I am using BestFirstCrawlingStrategy on website : https://www.adhirainternationalschool.co.in.
If you see for page : https://www.adhirainternationalschool.co.in/cbse-curriculum.php
one of the extracted link is : https://www.adhirainternationalschool.co.in/cbse-curriculum.php/pdf/Comprehensive Information.pdf
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
just runt the give code snippet with
python tests/test_DeepCrawler_relative_links.py --max-pages 1 --url https://www.adhirainternationalschool.co.in/cbse-curriculum.php
Code snippets
import argparse
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig, LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
import asyncio
async def main():
# Parse command line arguments
parser = argparse.ArgumentParser(description='Test iframe processing with output directory')
parser.add_argument('--url', default="https://www.adhirainternationalschool.co.in/cbse-curriculum.php", help='URL to crawl')
parser.add_argument('--max-pages', type=int, default=1, help='Maximum number of pages to crawl')
args = parser.parse_args()
print(f"Crawling URL: {args.url}")
# Create a scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["school", "fee", "information", "student", "admission", "sports"],
weight=0.7
)
# Configure the strategy
best_strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=args.max_pages,
)
config = CrawlerRunConfig(
deep_crawl_strategy=best_strategy,
scraping_strategy=LXMLWebScrapingStrategy(),
cache_mode=CacheMode.BYPASS,
stream=True,
remove_forms=False,
process_iframes=False,
scan_full_page=True,
wait_for_images=False,
exclude_all_images=True,
wait_until="domcontentloaded",
)
async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
async for result in await crawler.arun(args.url, config=config):
print(f"Links: {result.links['internal']}")
if __name__ == "__main__":
asyncio.run(main())
OS
Linux PopOS
Python version
3.9.23
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response
crawl4ai version
0.6.3
Expected Behavior
But Expected link is : https://www.adhirainternationalschool.co.in/pdf/Comprehensive Information.pdf
Because of result.link containing incorrect relative link, the links in
bff_strategy.pyalso resolves to incorrect one innormalize_url_for_deep_crawleven though it is trying to usingurljoin.I think we need to use
urljoinwhere the links are extracted itself. So that theresult.links['internal'][x].hrefis correct in the first place.Current Behavior
I am using BestFirstCrawlingStrategy on website :
https://www.adhirainternationalschool.co.in.If you see for page : https://www.adhirainternationalschool.co.in/cbse-curriculum.php
one of the extracted link is :
https://www.adhirainternationalschool.co.in/cbse-curriculum.php/pdf/Comprehensive Information.pdfIs this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Linux PopOS
Python version
3.9.23
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response