Summary
Implement automatic discovery and parsing of llms.txt, sitemap.xml, and related files to enhance archons/crawl4ai's capabilities for AI-driven content consumption and comprehensive site crawling.
Problem Statement
As AI-driven content consumption becomes standard practice, crawlers need to automatically discover and utilize specialized files that help LLMs understand websites better. Currently, crawl4ai requires manual specification of these files, missing valuable structured information that could improve crawling efficiency and content extraction quality.
Proposed Solution
Core Discovery Features
1. File Types to Discover
DISCOVERY_TARGETS = {
'llm_files': [
'/llms.txt', # Primary LLM documentation
'/llms-full.txt', # Comprehensive version
'/llms.md', # Markdown variant
'/llms-ctx.txt', # Context-optimized version
],
'sitemap_files': [
'/sitemap.xml',
'/sitemap_index.xml',
'/sitemap-*.xml', # Numbered/dated variants
'/sitemaps/*.xml', # Subdirectory patterns
],
'metadata_files': [
'/robots.txt', # Contains sitemap references
'/.well-known/*', # RFC 8615 directory
'/humans.txt',
'/security.txt',
]
}
2. Discovery Methods
Priority Order:
- Parse
robots.txt for Sitemap directives
- Check standard URL patterns (root directory)
- Parse HTML meta tags and link elements
- Check
.well-known directory
- Try common variations with wildcards
Summary
Implement automatic discovery and parsing of
llms.txt,sitemap.xml, and related files to enhance archons/crawl4ai's capabilities for AI-driven content consumption and comprehensive site crawling.Problem Statement
As AI-driven content consumption becomes standard practice, crawlers need to automatically discover and utilize specialized files that help LLMs understand websites better. Currently, crawl4ai requires manual specification of these files, missing valuable structured information that could improve crawling efficiency and content extraction quality.
Proposed Solution
Core Discovery Features
1. File Types to Discover
2. Discovery Methods
Priority Order:
robots.txtfor Sitemap directives.well-knowndirectory