feat: add URLScan.io as passive subdomain source#1710
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds a new urlscan passive source with API-key support, paginated search_after queries, retry/backoff and statistics; registers it in passive sources, updates default rate limits and tests, and adjusts error wrapping in a ctlogs file. (49 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Runner/Caller
participant Source as urlscan.Source
participant API as urlscan.io API
participant Consumer as Results Consumer
rect rgba(200,230,255,0.5)
Runner->>Source: Run(ctx, domain, session)
end
rect rgba(230,255,200,0.5)
Source->>Source: select random API key
Source->>API: GET /search/?q=... (api-key header, paginated, search_after)
API-->>Source: JSON response (results, has_more, sort values)
Source->>Source: parse results, extract subdomains, update stats
Source->>API: GET next page (with search_after) [retry/backoff on 429/503]
end
rect rgba(255,230,200,0.5)
Source->>Consumer: emit Subdomain results (channel)
Consumer-->>Source: context cancel / receive
Source->>Source: update counters, close channel
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@dogancanbakir leakix test is failing here. Let me know what needs to be done? |
|
@Jigardjain This is great! 🔥 We can do the following:
|
|
@dogancanbakir URLscan supports pagination and it is been implemented. Also, i have created an issue for the leakix test failing. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pkg/subscraping/sources/urlscan/urlscan.go`:
- Around line 60-66: The Run method resets s.errors, s.results, and s.requests
but not s.skipped, causing stale stats; update the Run(ctx context.Context,
domain string, session *subscraping.Session) method to set s.skipped = false at
the start alongside the other counters so Statistics() reflects the current run
state (modify the Source.Run function where s.errors/s.results/s.requests are
initialized).
- Around line 210-244: The current request loop returns resp,nil for non-OK
statuses (when err==nil), causing callers like enumerate to decode error
payloads as success; update the logic in the retry block (the code handling
resp, attempt, maxRetries, backoff in urlscan.go) so that for any resp with
StatusCode != http.StatusOK and not a retryable status (not 429/503) you call
session.DiscardHTTPResponse(resp) and return a non-nil error (e.g.
fmt.Errorf("unexpected status %d", resp.StatusCode)) instead of returning
resp,nil; keep existing retry behavior for 429/503 using the
X-Rate-Limit-Reset-After header, ctx cancellation, and exponential backoff.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/subscraping/sources/urlscan/urlscan.go`:
- Around line 210-259: The final "return nil, fmt.Errorf(\"max retries
exceeded\")" after the retry loop is unreachable because every path inside the
for attempt := 0; attempt <= maxRetries; attempt++ loop returns; remove that
trailing return to clean up dead code (delete the return nil, fmt.Errorf("max
retries exceeded") line in pkg/subscraping/sources/urlscan/urlscan.go) or
alternatively change the loop to a strict < maxRetries form if you intend a
post-loop fallback; target the retry loop using the session.Get / attempt logic
and the trailing return statement for the fix.
🧹 Nitpick comments (1)
pkg/subscraping/sources/urlscan/urlscan.go (1)
116-122: UnnecessaryDiscardHTTPResponsecall whenrespisnil.When
makeRequestWithRetryreturns an error, it already discards the response and returnsnilforresp. Line 120 callsDiscardHTTPResponse(resp)withnil, which is a no-op.🧹 Suggested cleanup
resp, err := s.makeRequestWithRetry(ctx, session, searchURL, headers) if err != nil { results <- subscraping.Result{Source: s.Name(), Type: subscraping.Error, Error: err} s.errors++ - session.DiscardHTTPResponse(resp) return }
Add URLScan.io as a new passive subdomain enumeration source with full pagination support and robust rate limiting handling. Features: - Fetches subdomains from URLScan.io Search API - Implements cursor-based pagination using search_after parameter - Extracts domains from task.domain, task.url, page.domain, page.url fields - Requires API key (free tier available at urlscan.io) Rate Limiting: - Conservative pagination delay (10s between pages) to respect strict burst limits - Exponential backoff retry logic for 429/503 responses - Respects X-Rate-Limit-Reset-After header for dynamic backoff - Limited to 5 pages max (500 results) to avoid quota exhaustion Configuration: - Max 5 pages per enumeration (configurable via maxPages constant) - 100 results per page (configurable via maxPerPage constant) - 2 retry attempts for rate-limited requests - 20 second initial backoff, doubles on each retry Changes: - pkg/subscraping/sources/urlscan/urlscan.go: New URLScan source implementation - pkg/passive/sources.go: Register URLScan source - pkg/passive/sources_test.go: Add URLScan to test lists - pkg/runner/options.go: Add urlscan to source options - .github/workflows/build-test.yml: Add URLSCAN_API_KEY secret Closes: Feature request for URLScan.io integration
a324595 to
978addb
Compare
Replace deprecated github.com/projectdiscovery/utils/errors package with standard Go error wrapping using fmt.Errorf to fix staticcheck SA1019 linter errors.
Add leakix, reconeer, and sitedossier to ignored sources list: - leakix: now requires API key (returns 401) - reconeer: now requires API key (returns 401) - sitedossier: flaky, returns no results in CI
Remove custom pagination delay and retry logic since the session already handles rate limiting via MultiRateLimiter. This aligns with how other sources (shodan, virustotal) are implemented.
Add URLScan.io as Passive Subdomain Source
Description
This PR adds URLScan.io as a new high-quality passive subdomain enumeration source for Subfinder.
Changes
pkg/subscraping/sources/urlscan/urlscan.gopkg/passive/sources.gopkg/runner/options.go(urlscan=2/s)Source Details
https://urlscan.io/api/v1/search/api-key)Why URLScan.io?
Testing
✅ Successfully tested with uber.com (19+ subdomains found)
✅ No linter errors
✅ Follows project code conventions
✅ Proper error handling for rate limits
Configuration
Users add API key to config:
urlscan:
Summary by CodeRabbit
New Features
Tests
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.