rDNS-based noise filtering for known scanners. Closes #527 by Sahityaaryan · Pull Request #934 · GreedyBear-Project/GreedyBear

Sahityaaryan · 2026-03-03T14:48:55Z

Description

Add a daily reverse DNS cron job that identifies IPs belonging to known scanning
services (Shodan, Censys, Onyphe, etc.) and marks them as "known scanner" via
ip_reputation. This implements step 2 of #527 — rDNS-based filtering,
decoupled from the extraction pipeline.

What it does:

Queries IOCs with empty ip_reputation seen in the last 24h
Resolves PTR records with a 2s timeout per IP
Caches all results in a ReverseDNSRecord table (so IPs are never re-queried)
Matches PTR hostnames against a curated KNOWN_SCANNER_DOMAINS list
Updates matching IOCs to ip_reputation = "known scanner"
Feeds exclude "known scanner" by default (same pattern as mass scanner / tor exit node)

Also fixes _merge_iocs() to only overwrite ip_reputation when empty, preventing
extraction from wiping out reputation set by enrichment crons (discussed in #527).

Related issues

Closes Reduce noise with more reliability #527 (step 2: rDNS-based filtering)
Builds on add mass scanner extraction #541 (step 1: mass scanner list)

Type of change

Bug fix (non-breaking change which fixes an issue).
New feature (non-breaking change which adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
Chore (refactoring, dependency updates, CI/CD changes, code cleanup, docs-only changes).

Checklist

Formalities

I have read and understood the rules about how to Contribute to this project.
I chose an appropriate title for the pull request in the form: <feature name>. Closes #999
My branch is based on develop.
The pull request is for the branch develop.
I have reviewed and verified any LLM-generated code included in this PR.

Docs and tests

I documented my code changes with docstrings and/or comments.
I have checked if my changes affect user-facing behavior that is described in the docs. If so, I also created a pull request in the
docs repository.
Linter (Ruff) gave 0 errors. If you have correctly installed pre-commit, it does
these checks and adjustments on your behalf.
I have added tests for the feature/bug I solved.
All the tests gave 0 errors.

GUI changes

Ignore this section if you did not make any changes to the GUI.

I have provided a screenshot of the result in the PR.
I have created new frontend tests for the new component or updated existing ones.

Sahityaaryan · 2026-03-03T15:02:04Z

@regulartim

Covers the rDNS-based filtering (step 2) plus the _merge_iocs() reputation
overwrite fix we discussed. 29 new tests, full suite passes.

Happy to address any feedback.

regulartim

Hey @Sahityaaryan ! First of all, thanks for your work. Before going into a detailed code review, we need to discuss some conceptual challenges.

You introduced a new model but I would argue that we could use the existing Tag model for what you are doing. This would also have the advantage, that the DNS information automatically are available at our API endpoints. What do you think?
The central piece, the DNS lookup logic, won't perform good enough. Say we have 20.000 new IoCs in a day, which is not unrealistic, and 10% of them time out. Then this job will take 1 hour to complete. Do you see a way to process the IP addresses in batches? Maybe there is a python library that can do the lookups asynchronously and/or in parallel?

Sahityaaryan · 2026-03-08T01:37:34Z

@regulartim Great points, both addressed:

Tag model — You're right, I've replaced the custom ReverseDNSRecord model with Tag entries (source="rdns", key="ptr_record"). PTR data is now automatically available via the existing API endpoints and supports filtering with ?tag_key=ptr_record&tag_value=shodan. No new model or migration needed.
Parallel resolution — Switched from sequential lookups to concurrent.futures.ThreadPoolExecutor with 20 workers and batch processing (500 IPs/batch). With 20k IPs and 10% timeouts, this brings the runtime from ~1 hour down to a few minutes. Also added TagRepository.add_tags() for incremental tag creation (as opposed to the atomic replace used by
ThreatFox/AbuseIPDB).

Other changes in the rework:

Added _merge_iocs() tests for the reputation preservation fix
Exception isolation per IP in batch resolution (one failure can't crash the batch)
ignore_conflicts=True on bulk_create for crash-recovery safety

Let me know if anything else needs adjusting

regulartim

Hey @Sahityaaryan ! I tested your code and I am still worried about performance. The lookup for 500 IoCs took 15 seconds. Yes, we are now in the "minutes" and not in the "hours" anymore, but this is still a lot of time for a low priority job. I thought about ways to solve that and I have some ideas that I would like to discuss with you:

We want to find mass scanners. So instead of taking all IOCs from the last 24 hours, we could narrow down the list of candidates to check based on behavior. Mass scanners are seen regularly (so len(days_seen) > 2), they have 0 login_attempts and they perform not many interactions per attack (something like interaction_count < 2*attack_count). So what do you think of taking these metrics to get the 500 most probable candidates and only check them?
If we can get the runtime down to seconds, we can run the job daily.
I am not sure anymore if we should track IP addresses with an empty ptr record. If the record ever gets updated, we will miss it. This could be the case for very new mass scanners.

What do you think of the above?

regulartim · 2026-03-09T09:31:56Z

            self.exclude_reputation.append("tor exit node")
+        if "include_known_scanners" not in query_params:
+            self.exclude_reputation.append("known scanner")



What is this change for? It changes the way the API behaves , right?

Yes, it changes the default API behavior — IPs with ip_reputation="known scanner" (set by the rDNS job) are now excluded from feeds by default, same as "mass scanner" and "tor exit node" on the lines above.
Users can opt-in with ?include_known_scanners in the query params.

Without this, the rDNS job would identify known scanners but they'd still appear in feeds — the "reduce noise" goal from #527 wouldn't be achieved.

That said, if you'd prefer to keep this PR focused on detection only and add the feed exclusion separately, I'm happy to remove it.

Ah, now I see! Yeah, please don't change the API right now. The "known scanner" wording is also a bit confusing. Please set the reputation to "mass scanner" and don't introduce a new wording.

regulartim · 2026-03-09T09:41:10Z

+        2. Process in batches: resolve PTR in parallel, store as tags,
+           update reputation for matches.
+        """
+        cutoff = timezone.now() - timedelta(hours=24)


Please use datetime.now() to stay consistent with the rest of the codebase.

regulartim · 2026-03-09T17:15:12Z

+        Args:
+            ip_address: IP address to update.
+        """
+        updated = self.ioc_repo.update_ioc_reputation(ip_address, "known scanner")


Here: please set the reputation to "mass scanner" instead of "known scanner".

Sahityaaryan · 2026-03-09T17:19:30Z

Hey @Sahityaaryan ! I tested your code and I am still worried about performance. The lookup for 500 IoCs took 15 seconds. Yes, we are now in the "minutes" and not in the "hours" anymore, but this is still a lot of time for a low priority job. I thought about ways to solve that and I have some ideas that I would like to discuss with you:
1. We want to find mass scanners. So instead of taking all IOCs from the last 24 hours, we could narrow down the list of candidates to check based on behavior. Mass scanners are seen regularly (so `len(days_seen) > 2`), they have 0 `login_attempts` and they perform not many interactions per attack (something like `interaction_count < 2*attack_count`). So what do you think of taking these metrics to get the 500 most probable candidates and only check them?

2. If we can get the runtime down to seconds, we can run the job daily.

3. I am not sure anymore if we should track IP addresses with an empty ptr record. If the record ever gets updated, we will miss it. This could be the case for very new mass scanners.
What do you think of the above?

Great to see that both of us are online :_)

I have made some changes in the local following your review

Behavioral filtering — Implemented. The candidate query now filters on number_of_days_seen > 2, login_attempts = 0, and interaction_count < 2 * attack_count, ordered by -number_of_days_seen and capped at
This should bring the actual candidate count well below 500 in most cases, keeping runtime in the low seconds.
Daily schedule — Already in place (06:07 daily), so no change needed there.
Empty PTR records — Agreed, removed. Only IPs with actual PTR records are tagged now. IPs that don't resolve are left untagged so they're rechecked on the next run.

regulartim · 2026-03-09T18:16:13Z

Already in place (06:07 daily), so no change needed there.

Sorry, totally missed that!

Sahityaaryan · 2026-03-09T18:19:43Z

No worries! Pushing the updated code now with all the changes we discussed

Sahityaaryan · 2026-03-09T19:01:58Z

@regulartim
Pushed the updated changes — behavioral filtering, empty PTR removal, and "mass scanner" wording are all in place. Ready for re-review!

regulartim

Looks good! :) Only minor details left, then we can merge.

regulartim

Thank you, good work! :)

Please, for future PRs, try to avoid force-pushing changes. It's easier to review when you can explicitly look at the changes that happened since the last review. Force-pushes make that impossible.

Sahityaaryan · 2026-03-10T18:13:05Z

Thank you, good work! :)

Please, for future PRs, try to avoid force-pushing changes. It's easier to review when you can explicitly look at the changes that happened since the last review. Force-pushes make that impossible.

sure I will take care of it next time. Thank you very much @regulartim :)

Sahityaaryan · 2026-03-10T22:52:15Z

Hi @regulartim ,

I've been diving deeper into the codebase and found some architectural observations I'd love to discuss. I tried finding you on the Honeynet Discord but there are few people with the similar name - could you let me know your Discord username or just send me a hi over it, so I can DM you, please :)

regulartim · 2026-03-11T06:23:13Z

I am "tim." on discord. But I don't use discord that often. You can also reach out on the Honeynet GSoC Slack.

Sahityaaryan · 2026-03-11T15:54:20Z

@regulartim ,
Thanks Tim! I tried the Honeynet GSoC Slack but the invite link (gsoc-slack.honeynet.org) seems to have expired. Could you share an updated invite link? In the meantime, I'll DM you on Discord — I'm "Sahityaaryan" there.

:(

Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch 2 times, most recently from 30ebac7 to 8ed025d Compare March 4, 2026 14:49

regulartim reviewed Mar 5, 2026

View reviewed changes

Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch from 8ed025d to 84f1ddb Compare March 8, 2026 01:28

Sahityaaryan changed the title ~~feat: rDNS-based noise filtering for known scanners. Closes #527~~ rDNS-based noise filtering for known scanners. Closes #527 Mar 8, 2026

regulartim requested changes Mar 9, 2026

View reviewed changes

regulartim reviewed Mar 9, 2026

View reviewed changes

Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch from 84f1ddb to a94ea04 Compare March 9, 2026 18:57

regulartim requested changes Mar 10, 2026

View reviewed changes

Comment thread greedybear/cronjobs/extraction/ioc_processor.py Outdated

Comment thread greedybear/cronjobs/repositories/tag.py

Comment thread greedybear/cronjobs/reverse_dns.py

Add rDNS-based mass scanner detection via behavioral heuristics

fcdc4ae

Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch from a94ea04 to fcdc4ae Compare March 10, 2026 14:52

regulartim approved these changes Mar 10, 2026

View reviewed changes

regulartim merged commit 5cbda72 into GreedyBear-Project:develop Mar 10, 2026
4 checks passed

Uh oh!

Conversation

Sahityaaryan commented Mar 3, 2026

Description

Related issues

Type of change

Checklist

Formalities

Docs and tests

GUI changes

Uh oh!

Sahityaaryan commented Mar 3, 2026

Uh oh!

regulartim left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sahityaaryan commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

regulartim Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Sahityaaryan Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Sahityaaryan commented Mar 9, 2026

Uh oh!

regulartim commented Mar 9, 2026

Uh oh!

Sahityaaryan commented Mar 9, 2026

Uh oh!

Sahityaaryan commented Mar 9, 2026

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sahityaaryan commented Mar 10, 2026

Uh oh!

Sahityaaryan commented Mar 10, 2026

Uh oh!

regulartim commented Mar 11, 2026

Uh oh!

Sahityaaryan commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

regulartim left a comment •

edited

Loading

Sahityaaryan commented Mar 8, 2026 •

edited

Loading

Sahityaaryan commented Mar 11, 2026 •

edited

Loading