gotor
gotor copied to clipboard
Streamed HTML parsing + content sniffing
Summary
Adopt streamed parsing for HTML to reduce allocations, and do early content-type sniffing to skip binary/large content unless configured.
Motivation
- Lower memory usage during large crawls
- Skip non-HTML payloads by default
Scope
-
internal/parse:- Streaming parse (
net/htmland/orgoqueryon aReader) - Extract absolute links (respect
basetags) - Sniff Content-Type + size guardrails
- Streaming parse (
- Config flag to allow binary downloads
Acceptance Criteria
- Heap profile shows fewer allocations vs baseline
- Tests cover: base href, meta refresh, unusual encodings
Tasks
- [ ] Implement streamed extraction
- [ ] Add content-type guards
- [ ] Unit tests with fixture pages