fix: Refactor MassScannersCron to handle flexible IP formats. Closes #678 by opbot-xd · Pull Request #685 · intelowlproject/GreedyBear

opbot-xd · 2026-01-06T14:11:07Z

Description

This PR refactors the MassScannersCron job to handle diverse IP address formats from the external data source, eliminating false "unexpected line" warnings.

Problem: The old strict regex pattern (IPv4 # reason) flagged valid data as "unexpected" (plain IPs, IPv6, special strings).

Solution: Implemented both suggestions from issue discussion:

Flexible IP validation - Simple regex extracts candidates, is_valid_ipv4() validates them
Reduced log level - Changed WARNING → DEBUG for non-IP lines (expected from external source)

Changes:

Added is_valid_ipv4() utility in greedybear/cronjobs/extraction/utils.py (DRY principle)
Refactored get_ioc_type() to use new utility
Updated MassScannersCron parser with flexible extraction + validation
Added 26 comprehensive tests (11 for is_valid_ipv4(), 15 for MassScannersCron)

Now correctly handles:

Plain IPs: 45.83.67.252 ✅
IPs with comments: 192.168.1.100 # normal comment ✅
IPv6 (skipped at DEBUG): 2001:db8::1 ✅
Invalid strings (skipped at DEBUG): /w00tw00t.at.ISC.SANS.DFind:) ✅

Related issues

Closes #678

Type of change

Bug fix (non-breaking change which fixes an issue).

Checklist

I have read and understood the rules about how to Contribute to this project.
The pull request is for the branch develop.
I have added documentation of the new features.
Linter (Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).

- Added is_valid_ipv4() utility function in extraction/utils.py for centralized IPv4 validation following DRY principle - Refactored get_ioc_type() to use the new is_valid_ipv4() utility - Updated MassScannersCron to use flexible regex pattern that extracts IP candidates and validates them programmatically - Changed log level from WARNING to DEBUG for non-IP lines since external data sources naturally contain various formats (IPv6, plain IPs without comments, other strings) - Added comprehensive tests for is_valid_ipv4() covering edge cases: * Valid IPs with/without whitespace * Out-of-range octets (>255) * Incomplete/malformed IPs * IPv6 addresses (correctly rejected) * Random strings and special characters - Added MassScannersCron integration tests using real-world examples: * Plain IPs without comments * IPs with comments (various formats) * IPv6 addresses (should be skipped) * Invalid strings like /w00tw00t.at.ISC.SANS.DFind:) * Mixed valid/invalid data All tests pass (67 total: 52 extraction utils + 15 mass scanners) Fixes issues with 'unexpected line' warnings for valid data formats that don't match the old strict regex pattern

opbot-xd · 2026-01-06T14:23:05Z

Test Coverage Rationale

Added 26 comprehensive tests because:

External data variability - The upstream source contains diverse formats (plain IPs, IPs with comments, IPv6, random strings) that we don't control
Core utility function - is_valid_ipv4() is now used by multiple modules, requiring thorough edge case validation
Regression prevention - Ensures the flexible parser correctly accepts valid IPs while rejecting invalid inputs
Documentation - Tests serve as living documentation of expected input formats and behavior

All 67 tests pass, giving confidence this handles real-world threat intelligence data correctly. ✅

opbot-xd · 2026-01-06T14:35:42Z

@regulartim Please review this PR when you get time...

mlodic

lgtm, waiting for a thumbs up from Tim too

regulartim

Sorry for being late to the party. Very nice work @opbot-xd ! Just a few minor things.

regulartim · 2026-01-08T07:19:16Z

greedybear/cronjobs/mass_scanners.py

+        # Simple regex to extract potential IPv4 addresses
+        ip_candidate_regex = re.compile(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")
+        # Regex to extract optional comment/reason after '#'
+        comment_regex = re.compile(r"#\s*(.+)", re.DOTALL)


I know that this was in the code already, but why do we have re.DOTALL here? To capture comments spanning multiple lines? @mlodic

Good catch! You're right - re.DOTALL is likely unnecessary here since we're already processing the data line-by-line with r.iter_lines().

The comment extraction happens on individual lines, so newlines within a single line shouldn't occur. I kept re.DOTALL from the original implementation to maintain the same behavior, but it's probably redundant.

I can remove it and simplify to:

comment_regex = re.compile(r"#\s*(.+)")

Let's see what @mlodic says. He surely had a reason to put it in.

I investigated whether multi-line comments actually occur:
Finding: They don't. After checking the actual data source, the format is strictly:
IP # single-line comment

Why re.DOTALL can't help anyway:

r.iter_lines() splits the HTTP response by newlines

Each iteration processes one line (newlines are delimiters, not content)

Even if a comment had \n in it, we'd never see it - it would be split into separate lines before reaching our regex

Sorry for the confusion! When I made the changes, I only saw the two specific review comments (about re.DOTALL and log placement), so I addressed both and pushed.

Your comment asking to defer to @mlodic loaded after I had already committed and pushed - GitHub timing issue on my end. (forgot to refresh the page.) 😅

I've already pushed the changes addressing both review points, but I'm happy to revert if @mlodic prefers to keep re.DOTALL for any reason I'm not aware of. Just let me know!

thanks for letting me check but honestly I don't remember. Let's keep your change

greedybear/cronjobs/mass_scanners.py

- Remove redundant re.DOTALL flag from comment_regex Since we process line-by-line with iter_lines(), multi-line comments cannot occur (newlines are delimiters, not content). Flag is unnecessary. - Move logging after save() to avoid misleading logs if DB operation fails Ensures we only log on successful database saves

opbot-xd marked this pull request as ready for review January 6, 2026 14:26

mlodic reviewed Jan 6, 2026

View reviewed changes

regulartim requested changes Jan 8, 2026

View reviewed changes

mlodic merged commit bdf1a18 into intelowlproject:develop Jan 8, 2026
5 checks passed

mlodic mentioned this pull request Jan 8, 2026

Thousands of warnings from MassScannersCron #678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Refactor MassScannersCron to handle flexible IP formats. Closes #678#685

fix: Refactor MassScannersCron to handle flexible IP formats. Closes #678#685
mlodic merged 2 commits intointelowlproject:developfrom
opbot-xd:fix/mass-scanners-unexpected-warnings

opbot-xd commented Jan 6, 2026

Uh oh!

opbot-xd commented Jan 6, 2026

Uh oh!

opbot-xd commented Jan 6, 2026

Uh oh!

mlodic left a comment

Uh oh!

regulartim left a comment

Uh oh!

regulartim Jan 8, 2026

Uh oh!

opbot-xd Jan 8, 2026

Uh oh!

regulartim Jan 8, 2026

Uh oh!

opbot-xd Jan 8, 2026

Uh oh!

opbot-xd Jan 8, 2026 •

edited

Loading

Uh oh!

mlodic Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

opbot-xd commented Jan 6, 2026

Description

Related issues

Type of change

Checklist

Uh oh!

opbot-xd commented Jan 6, 2026

Test Coverage Rationale

Uh oh!

opbot-xd commented Jan 6, 2026

Uh oh!

mlodic left a comment

Choose a reason for hiding this comment

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

opbot-xd Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

regulartim Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

opbot-xd Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

opbot-xd Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlodic Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

opbot-xd Jan 8, 2026 •

edited

Loading