rDNS-based noise filtering for known scanners. Closes #527#934
Conversation
|
Covers the rDNS-based filtering (step 2) plus the Happy to address any feedback. |
30ebac7 to
8ed025d
Compare
There was a problem hiding this comment.
Hey @Sahityaaryan ! First of all, thanks for your work. Before going into a detailed code review, we need to discuss some conceptual challenges.
-
You introduced a new model but I would argue that we could use the existing
Tagmodel for what you are doing. This would also have the advantage, that the DNS information automatically are available at our API endpoints. What do you think? -
The central piece, the DNS lookup logic, won't perform good enough. Say we have 20.000 new IoCs in a day, which is not unrealistic, and 10% of them time out. Then this job will take 1 hour to complete. Do you see a way to process the IP addresses in batches? Maybe there is a python library that can do the lookups asynchronously and/or in parallel?
8ed025d to
84f1ddb
Compare
|
@regulartim Great points, both addressed:
Other changes in the rework:
Let me know if anything else needs adjusting |
regulartim
left a comment
There was a problem hiding this comment.
Hey @Sahityaaryan ! I tested your code and I am still worried about performance. The lookup for 500 IoCs took 15 seconds. Yes, we are now in the "minutes" and not in the "hours" anymore, but this is still a lot of time for a low priority job. I thought about ways to solve that and I have some ideas that I would like to discuss with you:
-
We want to find mass scanners. So instead of taking all IOCs from the last 24 hours, we could narrow down the list of candidates to check based on behavior. Mass scanners are seen regularly (so
len(days_seen) > 2), they have 0login_attemptsand they perform not many interactions per attack (something likeinteraction_count < 2*attack_count). So what do you think of taking these metrics to get the 500 most probable candidates and only check them? -
If we can get the runtime down to seconds, we can run the job daily.
-
I am not sure anymore if we should track IP addresses with an empty ptr record. If the record ever gets updated, we will miss it. This could be the case for very new mass scanners.
What do you think of the above?
| self.exclude_reputation.append("tor exit node") | ||
| if "include_known_scanners" not in query_params: | ||
| self.exclude_reputation.append("known scanner") | ||
|
|
There was a problem hiding this comment.
What is this change for? It changes the way the API behaves , right?
There was a problem hiding this comment.
Yes, it changes the default API behavior — IPs with ip_reputation="known scanner" (set by the rDNS job) are now excluded from feeds by default, same as "mass scanner" and "tor exit node" on the lines above.
Users can opt-in with ?include_known_scanners in the query params.
Without this, the rDNS job would identify known scanners but they'd still appear in feeds — the "reduce noise" goal from #527 wouldn't be achieved.
That said, if you'd prefer to keep this PR focused on detection only and add the feed exclusion separately, I'm happy to remove it.
There was a problem hiding this comment.
Ah, now I see! Yeah, please don't change the API right now. The "known scanner" wording is also a bit confusing. Please set the reputation to "mass scanner" and don't introduce a new wording.
| 2. Process in batches: resolve PTR in parallel, store as tags, | ||
| update reputation for matches. | ||
| """ | ||
| cutoff = timezone.now() - timedelta(hours=24) |
There was a problem hiding this comment.
Please use datetime.now() to stay consistent with the rest of the codebase.
| Args: | ||
| ip_address: IP address to update. | ||
| """ | ||
| updated = self.ioc_repo.update_ioc_reputation(ip_address, "known scanner") |
There was a problem hiding this comment.
Here: please set the reputation to "mass scanner" instead of "known scanner".
Great to see that both of us are online :_) I have made some changes in the local following your review
|
Sorry, totally missed that! |
|
No worries! Pushing the updated code now with all the changes we discussed |
84f1ddb to
a94ea04
Compare
|
@regulartim |
regulartim
left a comment
There was a problem hiding this comment.
Looks good! :) Only minor details left, then we can merge.
a94ea04 to
fcdc4ae
Compare
regulartim
left a comment
There was a problem hiding this comment.
Thank you, good work! :)
Please, for future PRs, try to avoid force-pushing changes. It's easier to review when you can explicitly look at the changes that happened since the last review. Force-pushes make that impossible.
sure I will take care of it next time. Thank you very much @regulartim :) |
|
Hi @regulartim , I've been diving deeper into the codebase and found some architectural observations I'd love to discuss. I tried finding you on the Honeynet Discord but there are few people with the similar name - could you let me know your Discord username or just send me a hi over it, so I can DM you, please :) |
|
I am "tim." on discord. But I don't use discord that often. You can also reach out on the Honeynet GSoC Slack. |
|
@regulartim ,
:( |

Description
Add a daily reverse DNS cron job that identifies IPs belonging to known scanning
services (Shodan, Censys, Onyphe, etc.) and marks them as "known scanner" via
ip_reputation. This implements step 2 of #527 — rDNS-based filtering,decoupled from the extraction pipeline.
What it does:
ip_reputationseen in the last 24hReverseDNSRecordtable (so IPs are never re-queried)KNOWN_SCANNER_DOMAINSlistip_reputation = "known scanner"Also fixes
_merge_iocs()to only overwriteip_reputationwhen empty, preventingextraction from wiping out reputation set by enrichment crons (discussed in #527).
Related issues
Type of change
Checklist
Formalities
<feature name>. Closes #999develop.develop.Docs and tests
docs repository.
Ruff) gave 0 errors. If you have correctly installed pre-commit, it doesthese checks and adjustments on your behalf.
GUI changes
Ignore this section if you did not make any changes to the GUI.