Add crawl4ai docs functionality and update related files by CCwithAi · Pull Request #85 · coleam00/Archon

CCwithAi · 2025-03-23T22:55:47Z

Add crawl4ai docs functionality and update related files

CCwithAi · 2025-03-24T02:46:37Z

Added some missing code, refined prompt added small example.

CCwithAi · 2025-03-24T03:54:53Z

final issues resolved, you may have done this diff than I have, I was not sure if to add this as a seperate ai coder, but implemetation would be difficult as we need to use both code sources. I asked the agent a very random query = can you provide me an advanced website agent using crawl4ai and pydantic.ai using async, headless browser etc,

from future import annotations as _annotations

import os
from dataclasses import dataclass
from typing import Any

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from pydantic_ai import Agent, RunContext
import logfire

logfire.configure(send_to_logfire='if-token-present')

@DataClass
class Deps:
crawler: AsyncWebCrawler

advanced_crawler_agent = Agent(
'example:advanced-web-crawler',
system_prompt=("You are an advanced web crawler that can scrape data from specified URLs."
" Make sure to check for data extraction and logging."),
deps_type=Deps
)

@advanced_crawler_agent.tool
async def crawl_url(ctx: RunContext[Deps], url: str) -> dict[str, Any]:
"""Crawl a specified URL and return extracted information.

Args:
    ctx: The context.
    url: The URL to crawl.

Returns:
    A dictionary containing extraction results.
"""
run_config = CrawlerRunConfig(
    cache_mode='BYPASS',
    css_selector="article",
    word_count_threshold=10,
    screenshot=True
)

async with ctx.deps.crawler as crawler:
    result = await crawler.arun(url, config=run_config)

return {
    "url": result.url,
    "html_length": len(result.cleaned_html),
    "screenshot": result.screenshot
}

async def main():
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
deps = Deps(crawler=AsyncWebCrawler(config=browser_cfg))

result = await advanced_crawler_agent.run("Visit https://example.com and scrape the content.", deps=deps)
print("Crawled Result:", result.data)

if name == 'main':
import asyncio
asyncio.run(main())

Then I asked refine this so we can use crawler run config and cache mode for crawl4ai

Outputted code

from future import annotations

import os
import asyncio
from dataclasses import dataclass

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from pydantic_ai import Agent, RunContext

@DataClass
class Deps:
crawler: AsyncWebCrawler

advanced_crawl_agent = Agent(
'pydantic:generic-model', # Specify your model
deps_type=Deps,
retries=2,
)

@advanced_crawl_agent.tool
async def crawl_webpage(ctx: RunContext[Deps], url: str) -> dict:
"""Crawl a webpage using specified URL.

Args:
    ctx: The context containing dependencies.
    url: The target URL to crawl.

Returns:
    dict: Result of crawling, including HTML content and status.
"""
# Example configuration for CrawlerRunConfig
crawler_run_config = CrawlerRunConfig(
    cache_mode=CacheMode.OPTIMIZE,  # Choose an appropriate cache mode
    css_selector='body',
)

results = await ctx.deps.crawler.arun(url, config=crawler_run_config)
return {
    "url": results.url,
    "success": results.success,
    "html": results.cleaned_html,
    "error_message": results.error_message
}

async def main():
browser_config = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=False
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    deps = Deps(crawler=crawler)

    url = "https://example.com"  # Replace with the target URL
    result = await advanced_crawl_agent.run(url, deps=deps)
    print("Crawl Result:", result)

if name == 'main':
asyncio.run(main())

CCwithAi · 2025-03-24T04:07:26Z

then I asked make me an advanced Crawl4AI and Pydantic AI agent implementation incorporating asynchronous crawling, headless browser, comprehensive site map crawling, and markdown conversion, using the Google Vertex AI Gemini model properly via the documentation provided.

output was actually quite good, even though its still not complete.

Folder & File Structure:

your_agent_project/
├── agent.py
├── agent_tools.py
├── agent_prompts.py
├── .env.example
└── requirements.txt

requirements.txt

pydantic-ai
crawl4ai
httpx
python-dotenv
google-auth
google-cloud-aiplatform
beautifulsoup4
markdownify

.env.example

Google Vertex credentials, set up following Vertex AI guidelines.

GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json

agent.py

from pydantic_ai import Agent, RunContext
from dataclasses import dataclass
from agent_tools import full_site_crawl_and_markdown
from dotenv import load_dotenv
import os

load_dotenv()

@DataClass
class Deps:
google_credentials: str

agent = Agent(
"gemini-1.5-flash",
system_prompt="Scrape coding/AI websites by URL, download full site map HTML content and convert it to markdown. Utilize provided tools effectively.",
deps_type=Deps,
retries=2
)

@agent.tool
async def scrape_site_and_markdown(ctx: RunContext[Deps], url: str) -> dict:
result = await full_site_crawl_and_markdown(url)
return result

agent_tools.py

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
import asyncio
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import httpx

async def fetch_sitemap(url):
sitemap_url = url.rstrip('/') + "/sitemap.xml"
async with httpx.AsyncClient() as client:
response = await client.get(sitemap_url)
response.raise_for_status()
return response.text

async def parse_sitemap(xml_content):
soup = BeautifulSoup(xml_content, "xml")
urls = [loc.text for loc in soup.find_all("loc")]
return urls

async def crawl_page(crawler, url):
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result = await crawler.arun(url, config=config)
return md(result.cleaned_html)

async def full_site_crawl_and_markdown(base_url):
browser_cfg = BrowserConfig(browser_type="chromium", headless=True)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    sitemap_xml = await fetch_sitemap(base_url)
    urls = await parse_sitemap(sitemap_xml)

    markdown_pages = {}

    tasks = [crawl_page(crawler, url) for url in urls]

    markdown_results = await asyncio.gather(*tasks)

    for page_url, markdown in zip(urls, markdown_results):
        markdown_pages[page_url] = markdown

    return markdown_pages

agent_prompts.py

SYSTEM_PROMPT = """
You are an agent specialized in web crawling for coding and AI websites. Perform site crawls using the tools provided. Output markdown format for easy readability and archive purposes.
"""

Quick Run Example

Create a new file main_async.py:

import asyncio
import os
from agent import agent, Deps
from dotenv import load_dotenv

load_dotenv()

deps = Deps(google_credentials=os.getenv("GOOGLE_APPLICATION_CREDENTIALS"))

async def main():
url = "https://example-ai-coding-site.com"
result = await agent.run(f"Scrape and markdown the full site: {url}", deps=deps)
print(result.data)

if name == "main":
asyncio.run(main())

Explanation and Setup Instructions

To run the agent:

Install dependencies:

pip install -r requirements.txt

Adjust .env.example to .env and fill with your actual Google service account credentials path.

Execute script:

python main_async.py

Agent Features:

Utilizes Google Vertex AI Model (gemini-1.5-flash) integration correctly with proper imports.
Asynchronously crawls entire website based on sitemap.
Converts downloaded HTML into Markdown for clarity and easy storage.
Implements failure recovery and logging implicitly via raised exceptions and clear structure.

Would you like me to further refine this implementation autonomously, or would you prefer to review it and provide feedback for specific updates?

…ublish-method-and-remove-encodes fix: publish events with string NATS subjects

* Migrate from Node.js/npm/Jest to Bun runtime - Replace npm with bun for package management (bun.lock) - Replace Jest with bun:test for testing - Update tsconfig for Bun (ESNext module, bundler resolution) - Update Dockerfile to use oven/bun:1-slim - Update CI workflow to use oven-sh/setup-bun@v2 - Remove dynamic import hack from codex.ts (direct ESM imports) - Fix test mocking for Bun (export execFileAsync, use spyOn) - Update all documentation (CLAUDE.md, README.md, CONTRIBUTING.md) All 395 tests pass, type-check passes, E2E validated with curl. * ci: retrigger CI build * fix: make execFileAsync a function for better Bun mockability * fix: ensure execFileAsync returns string not Buffer * fix: rename _execFileAsync to comply with naming convention * fix: make mkdirAsync mockable for Bun tests * fix: update engines to bun>=1.0.0 and add mkdirAsync mock * fix: pin Bun to 1.3.4 in CI to fix mock.module test failures Newer Bun versions have different mock.module() behavior that causes cross-test module pollution, resulting in 71 test failures in CI while tests pass locally. Pinning to 1.3.4 ensures consistent behavior. * fix: run orchestrator tests last to avoid mock.module pollution Bun's mock.module() pollutes the global module cache, causing tests to fail when orchestrator.test.ts (which mocks command-handler and factory) runs before those modules' own test files. Fix by running tests in two batches: 1. All tests except orchestrator 2. Orchestrator tests last This ensures orchestrator's mocks don't affect other test files.

#1483) * chore(deps): remove stale package-lock.json to clear Dependabot noise This file was deleted in #85 (Bun migration) but accidentally re-committed in #89 unrelated to that PR's actual fix. It hasn't been touched since April and isn't used by anything (CI runs `bun install`), but Dependabot keeps scanning it — every one of the 21 open alerts triaged in #1353 is against this file, not bun.lock. Removing it closes all 21 alerts. The axios `^1.15.0` override in package.json stays — it's doing real work for the bun tree because @slack/bolt pulls in a vulnerable axios transitively (CVE-2025-62718). Add package-lock.json (and yarn/pnpm lockfiles) to .gitignore so this can't silently slip back in. Closes #1353 * chore(deps): patch four runtime CVEs in bun.lock via overrides Targets #1353 alerts that resolve in the actual runtime tree (bun.lock), not just the stale package-lock.json removed in the previous commit. Added overrides: - follow-redirects ^1.16.0 — auth-header leak on cross-domain redirect (GHSA-r4q5-vmmm-2653); via @slack/bolt - path-to-regexp ^8.4.2 — DoS via sequential optional groups (CVE-2026-4926, CVE-2026-4923); via @slack/bolt + claude-agent-sdk - qs ^6.15.1 — arrayLimit bypass DoS (CVE-2025-15284, CVE-2026-2391); via @slack/bolt - flatted ^3.4.2 — prototype pollution in parse() (CVE-2026-33228); dev-only via eslint chain bun audit confirms each resolves to a single non-vulnerable version across the tree. bun run validate green. No code changes — purely transitive bumps; we don't import any of these directly. Skipped (require deeper triage): undici, lodash, picomatch — each has multiple major versions resolved in the bun tree, so a single override would force-downgrade other consumers.

CCwithAi closed this Apr 1, 2025

POWERFULMOVES added a commit to POWERFULMOVES/PMOVES-Archon that referenced this pull request Feb 12, 2026

Merge pull request coleam00#85 from POWERFULMOVES/codex/update-nats-p…

f8dc7ae

…ublish-method-and-remove-encodes fix: publish events with string NATS subjects

Wirasm mentioned this pull request Apr 29, 2026

chore(deps): remove stale package-lock.json to clear Dependabot alerts #1483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add crawl4ai docs functionality and update related files#85

Add crawl4ai docs functionality and update related files#85
CCwithAi wants to merge 0 commit into
coleam00:mainfrom
CCwithAi:main

CCwithAi commented Mar 23, 2025

Uh oh!

CCwithAi commented Mar 24, 2025

Uh oh!

CCwithAi commented Mar 24, 2025

Uh oh!

CCwithAi commented Mar 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CCwithAi commented Mar 23, 2025

Uh oh!

CCwithAi commented Mar 24, 2025

Uh oh!

CCwithAi commented Mar 24, 2025

Uh oh!

CCwithAi commented Mar 24, 2025

Google Vertex credentials, set up following Vertex AI guidelines.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant