Skip to content

[Bug]: Incorrect crawlered code format of import xxx #1181

@haoyang9804

Description

@haoyang9804

crawl4ai version

0.5.0.post2

Expected Behavior

When crawling code blocks from the triton tutorial page: https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py, space between import and the package name is omitted.

The webpage contains several import statement:

import torch

import triton
import triton.language as tl

The crawlered results should contain exactly same code snippet.

Current Behavior

The crawlered import-related results are

importtorch
importtriton
importtriton.languageastl

Is this reproducible?

Yes

Inputs Causing the Bug

A simple reproducible script:


import asyncio
from crawl4ai import *

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html#sphx-glr-getting-started-tutorials-01-vector-add-py")

        # Print the extracted content
        print(result.markdown)

asyncio.run(main())

Steps to Reproduce

run the script, and you can find the incorrect `import` results.

Code snippets

OS

macOS

Python version

3.11.11

Browser

Chrome

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

Labels

⚙ DoneBug fix, enhancement, FR that's completed pending release🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions