Skip to content

rDNS-based noise filtering for known scanners. Closes #527#934

Merged
regulartim merged 1 commit intoGreedyBear-Project:developfrom
Sahityaaryan:fix/reduce-noise-with-more-reliability-527
Mar 10, 2026
Merged

rDNS-based noise filtering for known scanners. Closes #527#934
regulartim merged 1 commit intoGreedyBear-Project:developfrom
Sahityaaryan:fix/reduce-noise-with-more-reliability-527

Conversation

@Sahityaaryan
Copy link
Copy Markdown
Contributor

Description

Add a daily reverse DNS cron job that identifies IPs belonging to known scanning
services (Shodan, Censys, Onyphe, etc.) and marks them as "known scanner" via
ip_reputation. This implements step 2 of #527 — rDNS-based filtering,
decoupled from the extraction pipeline.

What it does:

  • Queries IOCs with empty ip_reputation seen in the last 24h
  • Resolves PTR records with a 2s timeout per IP
  • Caches all results in a ReverseDNSRecord table (so IPs are never re-queried)
  • Matches PTR hostnames against a curated KNOWN_SCANNER_DOMAINS list
  • Updates matching IOCs to ip_reputation = "known scanner"
  • Feeds exclude "known scanner" by default (same pattern as mass scanner / tor exit node)

Also fixes _merge_iocs() to only overwrite ip_reputation when empty, preventing
extraction from wiping out reputation set by enrichment crons (discussed in #527).

Related issues

Type of change

  • Bug fix (non-breaking change which fixes an issue).
  • New feature (non-breaking change which adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • Chore (refactoring, dependency updates, CI/CD changes, code cleanup, docs-only changes).

Checklist

Formalities

  • I have read and understood the rules about how to Contribute to this project.
  • I chose an appropriate title for the pull request in the form: <feature name>. Closes #999
  • My branch is based on develop.
  • The pull request is for the branch develop.
  • I have reviewed and verified any LLM-generated code included in this PR.

Docs and tests

  • I documented my code changes with docstrings and/or comments.
  • I have checked if my changes affect user-facing behavior that is described in the docs. If so, I also created a pull request in the
    docs repository.
  • Linter (Ruff) gave 0 errors. If you have correctly installed pre-commit, it does
    these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved.
  • All the tests gave 0 errors.

GUI changes

Ignore this section if you did not make any changes to the GUI.

  • I have provided a screenshot of the result in the PR.
  • I have created new frontend tests for the new component or updated existing ones.

@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

@regulartim

Covers the rDNS-based filtering (step 2) plus the _merge_iocs() reputation
overwrite fix we discussed. 29 new tests, full suite passes.

Happy to address any feedback.

@Sahityaaryan Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch 2 times, most recently from 30ebac7 to 8ed025d Compare March 4, 2026 14:49
Copy link
Copy Markdown
Member

@regulartim regulartim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Sahityaaryan ! First of all, thanks for your work. Before going into a detailed code review, we need to discuss some conceptual challenges.

  1. You introduced a new model but I would argue that we could use the existing Tag model for what you are doing. This would also have the advantage, that the DNS information automatically are available at our API endpoints. What do you think?

  2. The central piece, the DNS lookup logic, won't perform good enough. Say we have 20.000 new IoCs in a day, which is not unrealistic, and 10% of them time out. Then this job will take 1 hour to complete. Do you see a way to process the IP addresses in batches? Maybe there is a python library that can do the lookups asynchronously and/or in parallel?

@Sahityaaryan Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch from 8ed025d to 84f1ddb Compare March 8, 2026 01:28
@Sahityaaryan Sahityaaryan changed the title feat: rDNS-based noise filtering for known scanners. Closes #527 rDNS-based noise filtering for known scanners. Closes #527 Mar 8, 2026
@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

Sahityaaryan commented Mar 8, 2026

@regulartim Great points, both addressed:

  1. Tag model — You're right, I've replaced the custom ReverseDNSRecord model with Tag entries (source="rdns", key="ptr_record"). PTR data is now automatically available via the existing API endpoints and supports filtering with ?tag_key=ptr_record&tag_value=shodan. No new model or migration needed.

  2. Parallel resolution — Switched from sequential lookups to concurrent.futures.ThreadPoolExecutor with 20 workers and batch processing (500 IPs/batch). With 20k IPs and 10% timeouts, this brings the runtime from ~1 hour down to a few minutes. Also added TagRepository.add_tags() for incremental tag creation (as opposed to the atomic replace used by
    ThreatFox/AbuseIPDB).

Other changes in the rework:

  • Added _merge_iocs() tests for the reputation preservation fix
  • Exception isolation per IP in batch resolution (one failure can't crash the batch)
  • ignore_conflicts=True on bulk_create for crash-recovery safety

Let me know if anything else needs adjusting

Copy link
Copy Markdown
Member

@regulartim regulartim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Sahityaaryan ! I tested your code and I am still worried about performance. The lookup for 500 IoCs took 15 seconds. Yes, we are now in the "minutes" and not in the "hours" anymore, but this is still a lot of time for a low priority job. I thought about ways to solve that and I have some ideas that I would like to discuss with you:

  1. We want to find mass scanners. So instead of taking all IOCs from the last 24 hours, we could narrow down the list of candidates to check based on behavior. Mass scanners are seen regularly (so len(days_seen) > 2), they have 0 login_attempts and they perform not many interactions per attack (something like interaction_count < 2*attack_count). So what do you think of taking these metrics to get the 500 most probable candidates and only check them?

  2. If we can get the runtime down to seconds, we can run the job daily.

  3. I am not sure anymore if we should track IP addresses with an empty ptr record. If the record ever gets updated, we will miss it. This could be the case for very new mass scanners.

What do you think of the above?

Comment thread api/views/utils.py
self.exclude_reputation.append("tor exit node")
if "include_known_scanners" not in query_params:
self.exclude_reputation.append("known scanner")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for? It changes the way the API behaves , right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it changes the default API behavior — IPs with ip_reputation="known scanner" (set by the rDNS job) are now excluded from feeds by default, same as "mass scanner" and "tor exit node" on the lines above.
Users can opt-in with ?include_known_scanners in the query params.

Without this, the rDNS job would identify known scanners but they'd still appear in feeds — the "reduce noise" goal from #527 wouldn't be achieved.

That said, if you'd prefer to keep this PR focused on detection only and add the feed exclusion separately, I'm happy to remove it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see! Yeah, please don't change the API right now. The "known scanner" wording is also a bit confusing. Please set the reputation to "mass scanner" and don't introduce a new wording.

Comment thread greedybear/cronjobs/reverse_dns.py Outdated
2. Process in batches: resolve PTR in parallel, store as tags,
update reputation for matches.
"""
cutoff = timezone.now() - timedelta(hours=24)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use datetime.now() to stay consistent with the rest of the codebase.

Comment thread greedybear/cronjobs/reverse_dns.py Outdated
Args:
ip_address: IP address to update.
"""
updated = self.ioc_repo.update_ioc_reputation(ip_address, "known scanner")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here: please set the reputation to "mass scanner" instead of "known scanner".

@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

Hey @Sahityaaryan ! I tested your code and I am still worried about performance. The lookup for 500 IoCs took 15 seconds. Yes, we are now in the "minutes" and not in the "hours" anymore, but this is still a lot of time for a low priority job. I thought about ways to solve that and I have some ideas that I would like to discuss with you:

1. We want to find mass scanners. So instead of taking all IOCs from the last 24 hours, we could narrow down the list of candidates to check based on behavior. Mass scanners are seen regularly (so `len(days_seen) > 2`), they have 0 `login_attempts` and they perform not many interactions per attack (something like `interaction_count < 2*attack_count`). So what do you think of taking these metrics to get the 500 most probable candidates and only check them?

2. If we can get the runtime down to seconds, we can run the job daily.

3. I am not sure anymore if we should track IP addresses with an empty ptr record. If the record ever gets updated, we will miss it. This could be the case for very new mass scanners.

What do you think of the above?

Great to see that both of us are online :_)

I have made some changes in the local following your review

  1. Behavioral filtering — Implemented. The candidate query now filters on number_of_days_seen > 2, login_attempts = 0, and interaction_count < 2 * attack_count, ordered by -number_of_days_seen and capped at
  2. This should bring the actual candidate count well below 500 in most cases, keeping runtime in the low seconds.
  3. Daily schedule — Already in place (06:07 daily), so no change needed there.
  4. Empty PTR records — Agreed, removed. Only IPs with actual PTR records are tagged now. IPs that don't resolve are left untagged so they're rechecked on the next run.

@regulartim
Copy link
Copy Markdown
Member

Already in place (06:07 daily), so no change needed there.

Sorry, totally missed that!

@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

No worries! Pushing the updated code now with all the changes we discussed

@Sahityaaryan Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch from 84f1ddb to a94ea04 Compare March 9, 2026 18:57
@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

@regulartim
Pushed the updated changes — behavioral filtering, empty PTR removal, and "mass scanner" wording are all in place. Ready for re-review!

Copy link
Copy Markdown
Member

@regulartim regulartim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! :) Only minor details left, then we can merge.

Comment thread greedybear/cronjobs/extraction/ioc_processor.py Outdated
Comment thread greedybear/cronjobs/repositories/tag.py
Comment thread greedybear/cronjobs/reverse_dns.py
@Sahityaaryan Sahityaaryan force-pushed the fix/reduce-noise-with-more-reliability-527 branch from a94ea04 to fcdc4ae Compare March 10, 2026 14:52
Copy link
Copy Markdown
Member

@regulartim regulartim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, good work! :)

Please, for future PRs, try to avoid force-pushing changes. It's easier to review when you can explicitly look at the changes that happened since the last review. Force-pushes make that impossible.

@regulartim regulartim merged commit 5cbda72 into GreedyBear-Project:develop Mar 10, 2026
4 checks passed
@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

Thank you, good work! :)

Please, for future PRs, try to avoid force-pushing changes. It's easier to review when you can explicitly look at the changes that happened since the last review. Force-pushes make that impossible.

sure I will take care of it next time. Thank you very much @regulartim :)

@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

Hi @regulartim ,

I've been diving deeper into the codebase and found some architectural observations I'd love to discuss. I tried finding you on the Honeynet Discord but there are few people with the similar name - could you let me know your Discord username or just send me a hi over it, so I can DM you, please :)

@regulartim
Copy link
Copy Markdown
Member

I am "tim." on discord. But I don't use discord that often. You can also reach out on the Honeynet GSoC Slack.

@Sahityaaryan
Copy link
Copy Markdown
Contributor Author

Sahityaaryan commented Mar 11, 2026

@regulartim ,
Thanks Tim! I tried the Honeynet GSoC Slack but the invite link (gsoc-slack.honeynet.org) seems to have expired. Could you share an updated invite link? In the meantime, I'll DM you on Discord — I'm "Sahityaaryan" there.

image

:(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants