📋 Feature Request
Add advanced web crawling capabilities with comprehensive domain filtering and pattern matching configuration options.
🎯 Problem
Users need more control over web crawling to:
- Filter crawling to specific domains or subdomains
- Exclude unwanted domains from crawl scope
- Use URL pattern matching for inclusion/exclusion
- Configure advanced crawling behavior per website
- Handle complex multi-domain documentation sites
- Optimize crawling efficiency and relevance
✨ Proposed Solution
Advanced Domain Configuration UI
- Allowed Domains - Whitelist specific domains for crawling
- Excluded Domains - Blacklist domains to skip during crawl
- Include Patterns - URL patterns to include (regex support)
- Exclude Patterns - URL patterns to exclude (regex support)
- Domain Management - Add/remove domains with validation
- Pattern Testing - Validate patterns before crawling
Enhanced Crawling API
- CrawlConfig Interface - Structured domain filtering configuration
- Server-side Filtering - Backend validation and application of filters
- Pattern Matching - Support for complex URL pattern rules
- Performance Optimization - Skip filtered content during crawl process
Technical Requirements
- API: Enhanced /api/knowledge-items/crawl-v2 with crawl_config parameter
- Frontend: Collapsible advanced configuration panel
- Backend: Domain filtering integration in crawling service
- Validation: URL and pattern validation before crawl starts
🔧 Implementation Features
Domain Configuration Interface
- Expandable configuration panel in crawl modal
- Domain input with real-time validation
- Pattern input with regex support
- Clear visual feedback for added rules
- Remove/edit capability for existing rules
Crawling Engine Integration
- CrawlConfig passed to crawling service
- Domain filtering applied during URL discovery
- Pattern matching for fine-grained control
- Optimized crawling with early filtering
- Progress reporting with filtering status
Configuration Options
- allowed_domains: Array of allowed domain strings
- excluded_domains: Array of domains to skip
- include_patterns: Array of URL patterns to include
- exclude_patterns: Array of URL patterns to exclude
✅ Acceptance Criteria
🔗 Related
This enhances the core crawling functionality with advanced configuration options for better control and efficiency.
This issue tracks advanced crawling configuration only. Document browsing and upload features are handled separately.
📋 Feature Request
Add advanced web crawling capabilities with comprehensive domain filtering and pattern matching configuration options.
🎯 Problem
Users need more control over web crawling to:
✨ Proposed Solution
Advanced Domain Configuration UI
Enhanced Crawling API
Technical Requirements
🔧 Implementation Features
Domain Configuration Interface
Crawling Engine Integration
Configuration Options
✅ Acceptance Criteria
🔗 Related
This enhances the core crawling functionality with advanced configuration options for better control and efficiency.
This issue tracks advanced crawling configuration only. Document browsing and upload features are handled separately.