Skip to content

Automatic discovery of llms.txt or sitemap.xml #430

@leex279

Description

@leex279

Summary

Implement automatic discovery and parsing of llms.txt, sitemap.xml, and related files to enhance archons/crawl4ai's capabilities for AI-driven content consumption and comprehensive site crawling.

Problem Statement

As AI-driven content consumption becomes standard practice, crawlers need to automatically discover and utilize specialized files that help LLMs understand websites better. Currently, crawl4ai requires manual specification of these files, missing valuable structured information that could improve crawling efficiency and content extraction quality.

Proposed Solution

Core Discovery Features

1. File Types to Discover

DISCOVERY_TARGETS = {
    'llm_files': [
        '/llms.txt',           # Primary LLM documentation
        '/llms-full.txt',      # Comprehensive version
        '/llms.md',            # Markdown variant
        '/llms-ctx.txt',       # Context-optimized version
    ],
    'sitemap_files': [
        '/sitemap.xml',
        '/sitemap_index.xml',
        '/sitemap-*.xml',      # Numbered/dated variants
        '/sitemaps/*.xml',     # Subdirectory patterns
    ],
    'metadata_files': [
        '/robots.txt',         # Contains sitemap references
        '/.well-known/*',      # RFC 8615 directory
        '/humans.txt',
        '/security.txt',
    ]
}

2. Discovery Methods

Priority Order:

  1. Parse robots.txt for Sitemap directives
  2. Check standard URL patterns (root directory)
  3. Parse HTML meta tags and link elements
  4. Check .well-known directory
  5. Try common variations with wildcards

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status
Done (In Stable)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions