gotor

gotor copied to clipboard

Reame
Issues

Streamed HTML parsing + content sniffing

Open KingAkeem opened this issue 6 months ago • 0 comments

Summary

Adopt streamed parsing for HTML to reduce allocations, and do early content-type sniffing to skip binary/large content unless configured.

Motivation

Lower memory usage during large crawls
Skip non-HTML payloads by default

Scope

internal/parse:
- Streaming parse (net/html and/or goquery on a Reader)
- Extract absolute links (respect base tags)
- Sniff Content-Type + size guardrails
Config flag to allow binary downloads

Acceptance Criteria

Heap profile shows fewer allocations vs baseline
Tests cover: base href, meta refresh, unusual encodings

Tasks

[ ] Implement streamed extraction
[ ] Add content-type guards
[ ] Unit tests with fixture pages

Oct 03 '25 22:10 KingAkeem