gotor
gotor copied to clipboard
This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.
There are two methods of interacting with a node. - Crawl: traverse a node and it's children without storing them in memory - Load: stores and and it's children in...
It should be possible to open multiple tor connections using different SOCKS/CONTROL ports. It may provide a performance boost to execute requests using different connections. How to open multiple connections:...
429 indicates too many requests and possibly has a header attached that indicates when another request should be retried `Retry-After`. This could be used to pull those requests into a...
I found this wonderful snippet in `gocolly`, this file could make random headers rather simple: https://github.com/gocolly/colly/blob/master/extensions/random_user_agent.go
## Summary Split docs into a small docs site with quickstarts, configuration reference, API reference, and an ops guide (metrics/pprof). ## Motivation - Faster onboarding - Clear runbooks for operating...
## Summary Adopt streamed parsing for HTML to reduce allocations, and do early content-type sniffing to skip binary/large content unless configured. ## Motivation - Lower memory usage during large crawls...
## Summary Introduce structured logging with fields per fetch and level control via flags. ## Motivation - Parseable logs for pipelines/SIEM - Easier debugging at scale ## Scope - One...
## Summary Expose Prometheus metrics and pprof to operate and profile the crawler. ## Motivation - Visibility into throughput, latency, errors - Ability to capture CPU/heap profiles ## Scope -...
## Summary Add per-host concurrency caps and optional inter-request delay to avoid bans and reduce tail latencies. ## Motivation - Be a good citizen to hosts - Smooth out heavy-tailed...