Skip to content

Add crawl4ai docs functionality and update related files#85

Closed
CCwithAi wants to merge 0 commit into
coleam00:mainfrom
CCwithAi:main
Closed

Add crawl4ai docs functionality and update related files#85
CCwithAi wants to merge 0 commit into
coleam00:mainfrom
CCwithAi:main

Conversation

@CCwithAi

Copy link
Copy Markdown

Add crawl4ai docs functionality and update related files

@CCwithAi

Copy link
Copy Markdown
Author

Added some missing code, refined prompt added small example.

@CCwithAi

Copy link
Copy Markdown
Author

final issues resolved, you may have done this diff than I have, I was not sure if to add this as a seperate ai coder, but implemetation would be difficult as we need to use both code sources. I asked the agent a very random query = can you provide me an advanced website agent using crawl4ai and pydantic.ai using async, headless browser etc,

from future import annotations as _annotations

import os
from dataclasses import dataclass
from typing import Any

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from pydantic_ai import Agent, RunContext
import logfire

logfire.configure(send_to_logfire='if-token-present')

@DataClass
class Deps:
crawler: AsyncWebCrawler

advanced_crawler_agent = Agent(
'example:advanced-web-crawler',
system_prompt=("You are an advanced web crawler that can scrape data from specified URLs."
" Make sure to check for data extraction and logging."),
deps_type=Deps
)

@advanced_crawler_agent.tool
async def crawl_url(ctx: RunContext[Deps], url: str) -> dict[str, Any]:
"""Crawl a specified URL and return extracted information.

Args:
    ctx: The context.
    url: The URL to crawl.

Returns:
    A dictionary containing extraction results.
"""
run_config = CrawlerRunConfig(
    cache_mode='BYPASS',
    css_selector="article",
    word_count_threshold=10,
    screenshot=True
)

async with ctx.deps.crawler as crawler:
    result = await crawler.arun(url, config=run_config)

return {
    "url": result.url,
    "html_length": len(result.cleaned_html),
    "screenshot": result.screenshot
}

async def main():
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
deps = Deps(crawler=AsyncWebCrawler(config=browser_cfg))

result = await advanced_crawler_agent.run("Visit https://example.com and scrape the content.", deps=deps)
print("Crawled Result:", result.data)

if name == 'main':
import asyncio
asyncio.run(main())

Then I asked refine this so we can use crawler run config and cache mode for crawl4ai

Outputted code

from future import annotations

import os
import asyncio
from dataclasses import dataclass

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from pydantic_ai import Agent, RunContext

@DataClass
class Deps:
crawler: AsyncWebCrawler

advanced_crawl_agent = Agent(
'pydantic:generic-model', # Specify your model
deps_type=Deps,
retries=2,
)

@advanced_crawl_agent.tool
async def crawl_webpage(ctx: RunContext[Deps], url: str) -> dict:
"""Crawl a webpage using specified URL.

Args:
    ctx: The context containing dependencies.
    url: The target URL to crawl.

Returns:
    dict: Result of crawling, including HTML content and status.
"""
# Example configuration for CrawlerRunConfig
crawler_run_config = CrawlerRunConfig(
    cache_mode=CacheMode.OPTIMIZE,  # Choose an appropriate cache mode
    css_selector='body',
)

results = await ctx.deps.crawler.arun(url, config=crawler_run_config)
return {
    "url": results.url,
    "success": results.success,
    "html": results.cleaned_html,
    "error_message": results.error_message
}

async def main():
browser_config = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=False
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    deps = Deps(crawler=crawler)

    url = "https://example.com"  # Replace with the target URL
    result = await advanced_crawl_agent.run(url, deps=deps)
    print("Crawl Result:", result)

if name == 'main':
asyncio.run(main())

@CCwithAi

Copy link
Copy Markdown
Author

then I asked make me an advanced Crawl4AI and Pydantic AI agent implementation incorporating asynchronous crawling, headless browser, comprehensive site map crawling, and markdown conversion, using the Google Vertex AI Gemini model properly via the documentation provided.

output was actually quite good, even though its still not complete.

Folder & File Structure:

your_agent_project/
├── agent.py
├── agent_tools.py
├── agent_prompts.py
├── .env.example
└── requirements.txt

requirements.txt

pydantic-ai
crawl4ai
httpx
python-dotenv
google-auth
google-cloud-aiplatform
beautifulsoup4
markdownify

.env.example

Google Vertex credentials, set up following Vertex AI guidelines.

GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json

agent.py

from pydantic_ai import Agent, RunContext
from dataclasses import dataclass
from agent_tools import full_site_crawl_and_markdown
from dotenv import load_dotenv
import os

load_dotenv()

@DataClass
class Deps:
google_credentials: str

agent = Agent(
"gemini-1.5-flash",
system_prompt="Scrape coding/AI websites by URL, download full site map HTML content and convert it to markdown. Utilize provided tools effectively.",
deps_type=Deps,
retries=2
)

@agent.tool
async def scrape_site_and_markdown(ctx: RunContext[Deps], url: str) -> dict:
result = await full_site_crawl_and_markdown(url)
return result

agent_tools.py

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
import asyncio
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import httpx

async def fetch_sitemap(url):
sitemap_url = url.rstrip('/') + "/sitemap.xml"
async with httpx.AsyncClient() as client:
response = await client.get(sitemap_url)
response.raise_for_status()
return response.text

async def parse_sitemap(xml_content):
soup = BeautifulSoup(xml_content, "xml")
urls = [loc.text for loc in soup.find_all("loc")]
return urls

async def crawl_page(crawler, url):
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result = await crawler.arun(url, config=config)
return md(result.cleaned_html)

async def full_site_crawl_and_markdown(base_url):
browser_cfg = BrowserConfig(browser_type="chromium", headless=True)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    sitemap_xml = await fetch_sitemap(base_url)
    urls = await parse_sitemap(sitemap_xml)

    markdown_pages = {}

    tasks = [crawl_page(crawler, url) for url in urls]

    markdown_results = await asyncio.gather(*tasks)

    for page_url, markdown in zip(urls, markdown_results):
        markdown_pages[page_url] = markdown

    return markdown_pages

agent_prompts.py

SYSTEM_PROMPT = """
You are an agent specialized in web crawling for coding and AI websites. Perform site crawls using the tools provided. Output markdown format for easy readability and archive purposes.
"""

Quick Run Example

Create a new file main_async.py:

import asyncio
import os
from agent import agent, Deps
from dotenv import load_dotenv

load_dotenv()

deps = Deps(google_credentials=os.getenv("GOOGLE_APPLICATION_CREDENTIALS"))

async def main():
url = "https://example-ai-coding-site.com"
result = await agent.run(f"Scrape and markdown the full site: {url}", deps=deps)
print(result.data)

if name == "main":
asyncio.run(main())

Explanation and Setup Instructions

To run the agent:

Install dependencies:

pip install -r requirements.txt

Adjust .env.example to .env and fill with your actual Google service account credentials path.

Execute script:

python main_async.py

Agent Features:

Utilizes Google Vertex AI Model (gemini-1.5-flash) integration correctly with proper imports.
Asynchronously crawls entire website based on sitemap.
Converts downloaded HTML into Markdown for clarity and easy storage.
Implements failure recovery and logging implicitly via raised exceptions and clear structure.

Would you like me to further refine this implementation autonomously, or would you prefer to review it and provide feedback for specific updates?

@CCwithAi CCwithAi closed this Apr 1, 2025
POWERFULMOVES added a commit to POWERFULMOVES/PMOVES-Archon that referenced this pull request Feb 12, 2026
…ublish-method-and-remove-encodes

fix: publish events with string NATS subjects
coleam00 pushed a commit that referenced this pull request Apr 7, 2026
* Migrate from Node.js/npm/Jest to Bun runtime

- Replace npm with bun for package management (bun.lock)
- Replace Jest with bun:test for testing
- Update tsconfig for Bun (ESNext module, bundler resolution)
- Update Dockerfile to use oven/bun:1-slim
- Update CI workflow to use oven-sh/setup-bun@v2
- Remove dynamic import hack from codex.ts (direct ESM imports)
- Fix test mocking for Bun (export execFileAsync, use spyOn)
- Update all documentation (CLAUDE.md, README.md, CONTRIBUTING.md)

All 395 tests pass, type-check passes, E2E validated with curl.

* ci: retrigger CI build

* fix: make execFileAsync a function for better Bun mockability

* fix: ensure execFileAsync returns string not Buffer

* fix: rename _execFileAsync to comply with naming convention

* fix: make mkdirAsync mockable for Bun tests

* fix: update engines to bun>=1.0.0 and add mkdirAsync mock

* fix: pin Bun to 1.3.4 in CI to fix mock.module test failures

Newer Bun versions have different mock.module() behavior that causes
cross-test module pollution, resulting in 71 test failures in CI while
tests pass locally. Pinning to 1.3.4 ensures consistent behavior.

* fix: run orchestrator tests last to avoid mock.module pollution

Bun's mock.module() pollutes the global module cache, causing tests to
fail when orchestrator.test.ts (which mocks command-handler and factory)
runs before those modules' own test files.

Fix by running tests in two batches:
1. All tests except orchestrator
2. Orchestrator tests last

This ensures orchestrator's mocks don't affect other test files.
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
* Migrate from Node.js/npm/Jest to Bun runtime

- Replace npm with bun for package management (bun.lock)
- Replace Jest with bun:test for testing
- Update tsconfig for Bun (ESNext module, bundler resolution)
- Update Dockerfile to use oven/bun:1-slim
- Update CI workflow to use oven-sh/setup-bun@v2
- Remove dynamic import hack from codex.ts (direct ESM imports)
- Fix test mocking for Bun (export execFileAsync, use spyOn)
- Update all documentation (CLAUDE.md, README.md, CONTRIBUTING.md)

All 395 tests pass, type-check passes, E2E validated with curl.

* ci: retrigger CI build

* fix: make execFileAsync a function for better Bun mockability

* fix: ensure execFileAsync returns string not Buffer

* fix: rename _execFileAsync to comply with naming convention

* fix: make mkdirAsync mockable for Bun tests

* fix: update engines to bun>=1.0.0 and add mkdirAsync mock

* fix: pin Bun to 1.3.4 in CI to fix mock.module test failures

Newer Bun versions have different mock.module() behavior that causes
cross-test module pollution, resulting in 71 test failures in CI while
tests pass locally. Pinning to 1.3.4 ensures consistent behavior.

* fix: run orchestrator tests last to avoid mock.module pollution

Bun's mock.module() pollutes the global module cache, causing tests to
fail when orchestrator.test.ts (which mocks command-handler and factory)
runs before those modules' own test files.

Fix by running tests in two batches:
1. All tests except orchestrator
2. Orchestrator tests last

This ensures orchestrator's mocks don't affect other test files.
Wirasm added a commit that referenced this pull request Apr 29, 2026
#1483)

* chore(deps): remove stale package-lock.json to clear Dependabot noise

This file was deleted in #85 (Bun migration) but accidentally re-committed
in #89 unrelated to that PR's actual fix. It hasn't been touched since
April and isn't used by anything (CI runs `bun install`), but Dependabot
keeps scanning it — every one of the 21 open alerts triaged in #1353 is
against this file, not bun.lock.

Removing it closes all 21 alerts. The axios `^1.15.0` override in
package.json stays — it's doing real work for the bun tree because
@slack/bolt pulls in a vulnerable axios transitively (CVE-2025-62718).

Add package-lock.json (and yarn/pnpm lockfiles) to .gitignore so this
can't silently slip back in.

Closes #1353

* chore(deps): patch four runtime CVEs in bun.lock via overrides

Targets #1353 alerts that resolve in the actual runtime tree (bun.lock),
not just the stale package-lock.json removed in the previous commit.

Added overrides:
- follow-redirects ^1.16.0 — auth-header leak on cross-domain redirect
  (GHSA-r4q5-vmmm-2653); via @slack/bolt
- path-to-regexp ^8.4.2 — DoS via sequential optional groups
  (CVE-2026-4926, CVE-2026-4923); via @slack/bolt + claude-agent-sdk
- qs ^6.15.1 — arrayLimit bypass DoS
  (CVE-2025-15284, CVE-2026-2391); via @slack/bolt
- flatted ^3.4.2 — prototype pollution in parse()
  (CVE-2026-33228); dev-only via eslint chain

bun audit confirms each resolves to a single non-vulnerable version
across the tree. bun run validate green. No code changes — purely
transitive bumps; we don't import any of these directly.

Skipped (require deeper triage): undici, lodash, picomatch — each has
multiple major versions resolved in the bun tree, so a single override
would force-downgrade other consumers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant