Skip to content

feat: Advanced Web Crawling with Domain Configuration #546

@leex279

Description

@leex279

📋 Feature Request

Add advanced web crawling capabilities with comprehensive domain filtering and pattern matching configuration options.

🎯 Problem

Users need more control over web crawling to:

  • Filter crawling to specific domains or subdomains
  • Exclude unwanted domains from crawl scope
  • Use URL pattern matching for inclusion/exclusion
  • Configure advanced crawling behavior per website
  • Handle complex multi-domain documentation sites
  • Optimize crawling efficiency and relevance

✨ Proposed Solution

Advanced Domain Configuration UI

  • Allowed Domains - Whitelist specific domains for crawling
  • Excluded Domains - Blacklist domains to skip during crawl
  • Include Patterns - URL patterns to include (regex support)
  • Exclude Patterns - URL patterns to exclude (regex support)
  • Domain Management - Add/remove domains with validation
  • Pattern Testing - Validate patterns before crawling

Enhanced Crawling API

  • CrawlConfig Interface - Structured domain filtering configuration
  • Server-side Filtering - Backend validation and application of filters
  • Pattern Matching - Support for complex URL pattern rules
  • Performance Optimization - Skip filtered content during crawl process

Technical Requirements

  • API: Enhanced /api/knowledge-items/crawl-v2 with crawl_config parameter
  • Frontend: Collapsible advanced configuration panel
  • Backend: Domain filtering integration in crawling service
  • Validation: URL and pattern validation before crawl starts

🔧 Implementation Features

Domain Configuration Interface

  • Expandable configuration panel in crawl modal
  • Domain input with real-time validation
  • Pattern input with regex support
  • Clear visual feedback for added rules
  • Remove/edit capability for existing rules

Crawling Engine Integration

  • CrawlConfig passed to crawling service
  • Domain filtering applied during URL discovery
  • Pattern matching for fine-grained control
  • Optimized crawling with early filtering
  • Progress reporting with filtering status

Configuration Options

  • allowed_domains: Array of allowed domain strings
  • excluded_domains: Array of domains to skip
  • include_patterns: Array of URL patterns to include
  • exclude_patterns: Array of URL patterns to exclude

✅ Acceptance Criteria

  • Advanced configuration panel toggles correctly
  • Domain input validates and adds domains to lists
  • Pattern input supports regex validation
  • Crawling respects domain filtering configuration
  • URL patterns work correctly (include/exclude)
  • Configuration persists during crawl session
  • Clear visual feedback for applied filters
  • Performance improvement over unfiltered crawling
  • Error handling for invalid patterns/domains
  • Integration with existing crawling workflow

🔗 Related

This enhances the core crawling functionality with advanced configuration options for better control and efficiency.


This issue tracks advanced crawling configuration only. Document browsing and upload features are handled separately.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status
Done (In Stable)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions