gotor icon indicating copy to clipboard operation
gotor copied to clipboard

Streamed HTML parsing + content sniffing

Open KingAkeem opened this issue 6 months ago • 0 comments

Summary

Adopt streamed parsing for HTML to reduce allocations, and do early content-type sniffing to skip binary/large content unless configured.

Motivation

  • Lower memory usage during large crawls
  • Skip non-HTML payloads by default

Scope

  • internal/parse:
    • Streaming parse (net/html and/or goquery on a Reader)
    • Extract absolute links (respect base tags)
    • Sniff Content-Type + size guardrails
  • Config flag to allow binary downloads

Acceptance Criteria

  • Heap profile shows fewer allocations vs baseline
  • Tests cover: base href, meta refresh, unusual encodings

Tasks

  • [ ] Implement streamed extraction
  • [ ] Add content-type guards
  • [ ] Unit tests with fixture pages

KingAkeem avatar Oct 03 '25 22:10 KingAkeem