crawl4ai version
0.4.247
Expected Behavior
When parsing the markdown of a given webpage,
- if the href in the anchor/img/link tag is a relative url, it should be combined with base url properly (or let it remain relative)
- if the href in the anchor/img/link tag is a absolute url, it should not be combined with the base url:
The code to extract the Markdown:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
)
return result.markdown
This is the expected Markdown of the webpage "https://docs.crawl4ai.com/"
[Crawl4AI Documentation (v0.4.3b2)](https://docs.crawl4ai.com/)
* [ Home ](https://docs.crawl4ai.com/)
* [ Quick Start ](https://docs.crawl4ai.com/core/quickstart/)
* [ Search ](https://docs.crawl4ai.com/#)
* Home
* Setup & Installation
* [Installation](https://docs.crawl4ai.com/core/installation/)
* [Docker Deployment](https://docs.crawl4ai.com/core/docker-deploymeny/)
* [Quick Start](https://docs.crawl4ai.com/core/quickstart/)
Current Behavior
When parsing the markdown of a given webpage, relative urls are not being converted properly. relative urls are combined with the base url as base_url/<relative_url> with the angle brackets '<' and '>' symbols.
Additionally, the relative url is being combined with base url even if the href contains absolute url.
This is the current Markdown of the webpage "https://docs.crawl4ai.com/"
[Crawl4AI Documentation (v0.4.3b2)](https://docs.crawl4ai.com/<https:/docs.crawl4ai.com/>)
* [ Home ](https://docs.crawl4ai.com/<.>)
* [ Quick Start ](https://docs.crawl4ai.com/<core/quickstart/>)
* [ Search ](https://docs.crawl4ai.com/<#>)
* Home
* Setup & Installation
* [Installation](https://docs.crawl4ai.com/<core/installation/>)
* [Docker Deployment](https://docs.crawl4ai.com/<core/docker-deploymeny/>)
* [Quick Start](https://docs.crawl4ai.com/<core/quickstart/>)
Side note: spelling mistake in https://docs.crawl4ai.com/core/docker-deploymeny/
Is this reproducible?
Yes
Inputs Causing the Bug
URL: https://docs.crawl4ai.com/
Steps to Reproduce
1. Run the below code snippet for the mentioned url
python webpage_crawler.py https://docs.crawl4ai.com/ crawl4ai.md
2. Compare the generated Markdown and the raw html.
Code snippets
# filename: webpage_crawler.py
import asyncio
from crawl4ai import AsyncWebCrawler
async def get_markdown(url: str) -> str:
async with AsyncWebCrawler() as crawler:
try:
result = await crawler.arun(
url=url,
)
if result.url != url:
print(f"Redirected to {result.url}")
if not result.success:
raise Exception(result.error_message)
if result.status_code == 404:
raise Exception(f"url not found")
return result.markdown
except Exception as err:
print("Crawler failed for", url)
raise err
async def get_cleaned_html(url: str) -> str:
async with AsyncWebCrawler() as crawler:
try:
result = await crawler.arun(
url=url,
)
if result.url != url:
print(f"Redirected to {result.url}")
if not result.success:
raise Exception(result.error_message)
if result.status_code == 404:
raise Exception(f"url not found")
return result.cleaned_html
except Exception as err:
print("Crawler failed for", url)
raise err
if __name__ == "__main__":
import sys
if len(sys.argv) != 3:
print("Usage: python script.py <url> <output_file>")
sys.exit(1)
url = sys.argv[1]
output_file = sys.argv[2]
markdown = asyncio.run(get_markdown(url))
print(len(markdown))
with open(output_file, 'w', encoding='utf-8') as f:
f.write(markdown)
OS
Windows 10 (also observed on Linux)
Python version
3.11
Browser
Google Chrome
Browser version
131.0.6778.265
Error logs & Screenshots (if applicable)
No response
crawl4ai version
0.4.247
Expected Behavior
When parsing the markdown of a given webpage,
The code to extract the Markdown:
This is the expected Markdown of the webpage "https://docs.crawl4ai.com/"
Current Behavior
When parsing the markdown of a given webpage, relative urls are not being converted properly. relative urls are combined with the base url as base_url/<relative_url> with the angle brackets '<' and '>' symbols.
Additionally, the relative url is being combined with base url even if the href contains absolute url.
This is the current Markdown of the webpage "https://docs.crawl4ai.com/"
Side note: spelling mistake in
https://docs.crawl4ai.com/core/docker-deploymeny/Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Run the below code snippet for the mentioned url python webpage_crawler.py https://docs.crawl4ai.com/ crawl4ai.md 2. Compare the generated Markdown and the raw html.Code snippets
OS
Windows 10 (also observed on Linux)
Python version
3.11
Browser
Google Chrome
Browser version
131.0.6778.265
Error logs & Screenshots (if applicable)
No response