Skip to content

[Bug]: Incorrect URLs in result.links for relative urls #1323

@nitirajrathore

Description

@nitirajrathore

crawl4ai version

0.6.3

Expected Behavior

But Expected link is : https://www.adhirainternationalschool.co.in/pdf/Comprehensive Information.pdf

Because of result.link containing incorrect relative link, the links in bff_strategy.py also resolves to incorrect one in normalize_url_for_deep_crawl even though it is trying to using urljoin.

def normalize_url_for_deep_crawl(href, base_url):
    """Normalize URLs to ensure consistent format"""
    from urllib.parse import urljoin, urlparse, urlunparse, parse_qs, urlencode

    # Handle None or empty values
    if not href:
        return None

    # Use urljoin to handle relative URLs
    full_url = urljoin(base_url, href.strip())
....

I think we need to use urljoin where the links are extracted itself. So that the result.links['internal'][x].href is correct in the first place.

Current Behavior

I am using BestFirstCrawlingStrategy on website : https://www.adhirainternationalschool.co.in.

If you see for page : https://www.adhirainternationalschool.co.in/cbse-curriculum.php

one of the extracted link is : https://www.adhirainternationalschool.co.in/cbse-curriculum.php/pdf/Comprehensive Information.pdf

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

just runt the give code snippet with 


python tests/test_DeepCrawler_relative_links.py --max-pages 1 --url https://www.adhirainternationalschool.co.in/cbse-curriculum.php

Code snippets

import argparse
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig, LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
import asyncio


async def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser(description='Test iframe processing with output directory')
    parser.add_argument('--url', default="https://www.adhirainternationalschool.co.in/cbse-curriculum.php", help='URL to crawl')
    parser.add_argument('--max-pages', type=int, default=1, help='Maximum number of pages to crawl')
    args = parser.parse_args()
    
    print(f"Crawling URL: {args.url}")
    

    # Create a scorer
    keyword_scorer = KeywordRelevanceScorer(
        keywords=["school", "fee", "information", "student", "admission", "sports"],
        weight=0.7
    )

    # Configure the strategy
    best_strategy = BestFirstCrawlingStrategy(
        max_depth=2,
        include_external=False,
        url_scorer=keyword_scorer,
        max_pages=args.max_pages,            
    )

    config = CrawlerRunConfig(
            deep_crawl_strategy=best_strategy,
            scraping_strategy=LXMLWebScrapingStrategy(),
            cache_mode=CacheMode.BYPASS,
            stream=True,
            remove_forms=False,
            process_iframes=False,
            scan_full_page=True,
            wait_for_images=False,
            exclude_all_images=True,
            wait_until="domcontentloaded",
        )
    

    async with AsyncWebCrawler(verbose=True, headless=False) as crawler:
        async for result in await crawler.arun(args.url, config=config):        
            print(f"Links: {result.links['internal']}")

if __name__ == "__main__":
    asyncio.run(main())

OS

Linux PopOS

Python version

3.9.23

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working🩺 Needs TriageNeeds attention of maintainers

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions