fix: Refactor MassScannersCron to handle flexible IP formats. Closes #678#685
Conversation
- Added is_valid_ipv4() utility function in extraction/utils.py for centralized IPv4 validation following DRY principle - Refactored get_ioc_type() to use the new is_valid_ipv4() utility - Updated MassScannersCron to use flexible regex pattern that extracts IP candidates and validates them programmatically - Changed log level from WARNING to DEBUG for non-IP lines since external data sources naturally contain various formats (IPv6, plain IPs without comments, other strings) - Added comprehensive tests for is_valid_ipv4() covering edge cases: * Valid IPs with/without whitespace * Out-of-range octets (>255) * Incomplete/malformed IPs * IPv6 addresses (correctly rejected) * Random strings and special characters - Added MassScannersCron integration tests using real-world examples: * Plain IPs without comments * IPs with comments (various formats) * IPv6 addresses (should be skipped) * Invalid strings like /w00tw00t.at.ISC.SANS.DFind:) * Mixed valid/invalid data All tests pass (67 total: 52 extraction utils + 15 mass scanners) Fixes issues with 'unexpected line' warnings for valid data formats that don't match the old strict regex pattern
Test Coverage RationaleAdded 26 comprehensive tests because:
All 67 tests pass, giving confidence this handles real-world threat intelligence data correctly. ✅ |
|
@regulartim Please review this PR when you get time... |
mlodic
left a comment
There was a problem hiding this comment.
lgtm, waiting for a thumbs up from Tim too
regulartim
left a comment
There was a problem hiding this comment.
Sorry for being late to the party. Very nice work @opbot-xd ! Just a few minor things.
greedybear/cronjobs/mass_scanners.py
Outdated
| # Simple regex to extract potential IPv4 addresses | ||
| ip_candidate_regex = re.compile(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})") | ||
| # Regex to extract optional comment/reason after '#' | ||
| comment_regex = re.compile(r"#\s*(.+)", re.DOTALL) |
There was a problem hiding this comment.
I know that this was in the code already, but why do we have re.DOTALL here? To capture comments spanning multiple lines? @mlodic
There was a problem hiding this comment.
Good catch! You're right - re.DOTALL is likely unnecessary here since we're already processing the data line-by-line with r.iter_lines().
The comment extraction happens on individual lines, so newlines within a single line shouldn't occur. I kept re.DOTALL from the original implementation to maintain the same behavior, but it's probably redundant.
I can remove it and simplify to:
comment_regex = re.compile(r"#\s*(.+)")There was a problem hiding this comment.
Let's see what @mlodic says. He surely had a reason to put it in.
There was a problem hiding this comment.
I investigated whether multi-line comments actually occur:
Finding: They don't. After checking the actual data source, the format is strictly:
IP # single-line comment
Why re.DOTALL can't help anyway:
r.iter_lines()splits the HTTP response by newlines- Each iteration processes one line (newlines are delimiters, not content)
- Even if a comment had
\nin it, we'd never see it - it would be split into separate lines before reaching our regex
There was a problem hiding this comment.
Sorry for the confusion! When I made the changes, I only saw the two specific review comments (about re.DOTALL and log placement), so I addressed both and pushed.
Your comment asking to defer to @mlodic loaded after I had already committed and pushed - GitHub timing issue on my end. (forgot to refresh the page.) 😅
I've already pushed the changes addressing both review points, but I'm happy to revert if @mlodic prefers to keep re.DOTALL for any reason I'm not aware of. Just let me know!
There was a problem hiding this comment.
thanks for letting me check but honestly I don't remember. Let's keep your change
- Remove redundant re.DOTALL flag from comment_regex Since we process line-by-line with iter_lines(), multi-line comments cannot occur (newlines are delimiters, not content). Flag is unnecessary. - Move logging after save() to avoid misleading logs if DB operation fails Ensures we only log on successful database saves
Description
This PR refactors the MassScannersCron job to handle diverse IP address formats from the external data source, eliminating false "unexpected line" warnings.
Problem: The old strict regex pattern (
IPv4 # reason) flagged valid data as "unexpected" (plain IPs, IPv6, special strings).Solution: Implemented both suggestions from issue discussion:
Changes:
Now correctly handles:
45.83.67.252✅192.168.1.100 # normal comment✅2001:db8::1✅/w00tw00t.at.ISC.SANS.DFind:)✅Related issues
Closes #678
Type of change
Checklist
develop.Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.