Skip to content

[Bug]: Relative Urls in the webpage not extracted properly #570

@Sparshsing

Description

@Sparshsing

crawl4ai version

0.4.247

Expected Behavior

When parsing the markdown of a given webpage,

  1. if the href in the anchor/img/link tag is a relative url, it should be combined with base url properly (or let it remain relative)
  2. if the href in the anchor/img/link tag is a absolute url, it should not be combined with the base url:

The code to extract the Markdown:

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url=url,
    )
    return result.markdown

This is the expected Markdown of the webpage "https://docs.crawl4ai.com/"

[Crawl4AI Documentation (v0.4.3b2)](https://docs.crawl4ai.com/)
  * [ Home ](https://docs.crawl4ai.com/)
  * [ Quick Start ](https://docs.crawl4ai.com/core/quickstart/)
  * [ Search ](https://docs.crawl4ai.com/#)


  * Home
  * Setup & Installation
    * [Installation](https://docs.crawl4ai.com/core/installation/)
    * [Docker Deployment](https://docs.crawl4ai.com/core/docker-deploymeny/)
  * [Quick Start](https://docs.crawl4ai.com/core/quickstart/)

Current Behavior

When parsing the markdown of a given webpage, relative urls are not being converted properly. relative urls are combined with the base url as base_url/<relative_url> with the angle brackets '<' and '>' symbols.
Additionally, the relative url is being combined with base url even if the href contains absolute url.

This is the current Markdown of the webpage "https://docs.crawl4ai.com/"

[Crawl4AI Documentation (v0.4.3b2)](https://docs.crawl4ai.com/<https:/docs.crawl4ai.com/>)
  * [ Home ](https://docs.crawl4ai.com/<.>)
  * [ Quick Start ](https://docs.crawl4ai.com/<core/quickstart/>)
  * [ Search ](https://docs.crawl4ai.com/<#>)


  * Home
  * Setup & Installation
    * [Installation](https://docs.crawl4ai.com/<core/installation/>)
    * [Docker Deployment](https://docs.crawl4ai.com/<core/docker-deploymeny/>)
  * [Quick Start](https://docs.crawl4ai.com/<core/quickstart/>)

Side note: spelling mistake in https://docs.crawl4ai.com/core/docker-deploymeny/

Is this reproducible?

Yes

Inputs Causing the Bug

URL: https://docs.crawl4ai.com/

Steps to Reproduce

1. Run the below code snippet for the mentioned url
   python webpage_crawler.py https://docs.crawl4ai.com/ crawl4ai.md
2. Compare the generated Markdown and the raw html.

Code snippets

# filename: webpage_crawler.py

import asyncio
from crawl4ai import AsyncWebCrawler

async def get_markdown(url: str) -> str:
    async with AsyncWebCrawler() as crawler:
        try:
            result = await crawler.arun(
                url=url,
            )
            if result.url != url:
                print(f"Redirected to {result.url}")
            if not result.success:
                raise Exception(result.error_message)
            if result.status_code == 404:
                raise Exception(f"url not found")
            return result.markdown
        except Exception as err:
            print("Crawler failed for", url)
            raise err
        

async def get_cleaned_html(url: str) -> str:
    async with AsyncWebCrawler() as crawler:
        try:
            result = await crawler.arun(
                url=url,
            )
            if result.url != url:
                print(f"Redirected to {result.url}")
            if not result.success:
                raise Exception(result.error_message)
            if result.status_code == 404:
                raise Exception(f"url not found")
            return result.cleaned_html
        except Exception as err:
            print("Crawler failed for", url)
            raise err

if __name__ == "__main__":
    import sys
    
    if len(sys.argv) != 3:
        print("Usage: python script.py <url> <output_file>")
        sys.exit(1)
    
    url = sys.argv[1]
    output_file = sys.argv[2]
    
    markdown = asyncio.run(get_markdown(url))
    print(len(markdown))
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(markdown)

OS

Windows 10 (also observed on Linux)

Python version

3.11

Browser

Google Chrome

Browser version

131.0.6778.265

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

Labels

⚙️ Under TestBug fix / Feature request that's under testing⚡ HighPriority - High🐞 BugSomething isn't working💪 - BeginnerDifficulty level - Beginners📌 Root causedidentified the root cause of bug

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions