Skip to content

fix: Refactor MassScannersCron to handle flexible IP formats. Closes #678#685

Merged
mlodic merged 2 commits intointelowlproject:developfrom
opbot-xd:fix/mass-scanners-unexpected-warnings
Jan 8, 2026
Merged

fix: Refactor MassScannersCron to handle flexible IP formats. Closes #678#685
mlodic merged 2 commits intointelowlproject:developfrom
opbot-xd:fix/mass-scanners-unexpected-warnings

Conversation

@opbot-xd
Copy link
Copy Markdown
Contributor

@opbot-xd opbot-xd commented Jan 6, 2026

Description

This PR refactors the MassScannersCron job to handle diverse IP address formats from the external data source, eliminating false "unexpected line" warnings.

Problem: The old strict regex pattern (IPv4 # reason) flagged valid data as "unexpected" (plain IPs, IPv6, special strings).

Solution: Implemented both suggestions from issue discussion:

  1. Flexible IP validation - Simple regex extracts candidates, is_valid_ipv4() validates them
  2. Reduced log level - Changed WARNING → DEBUG for non-IP lines (expected from external source)

Changes:

  • Added is_valid_ipv4() utility in greedybear/cronjobs/extraction/utils.py (DRY principle)
  • Refactored get_ioc_type() to use new utility
  • Updated MassScannersCron parser with flexible extraction + validation
  • Added 26 comprehensive tests (11 for is_valid_ipv4(), 15 for MassScannersCron)

Now correctly handles:

  • Plain IPs: 45.83.67.252
  • IPs with comments: 192.168.1.100 # normal comment
  • IPv6 (skipped at DEBUG): 2001:db8::1
  • Invalid strings (skipped at DEBUG): /w00tw00t.at.ISC.SANS.DFind:)

Related issues

Closes #678

Type of change

  • Bug fix (non-breaking change which fixes an issue).

Checklist

  • I have read and understood the rules about how to Contribute to this project.
  • The pull request is for the branch develop.
  • I have added documentation of the new features.
  • Linter (Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
  • If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).

- Added is_valid_ipv4() utility function in extraction/utils.py for
  centralized IPv4 validation following DRY principle
- Refactored get_ioc_type() to use the new is_valid_ipv4() utility
- Updated MassScannersCron to use flexible regex pattern that
  extracts IP candidates and validates them programmatically
- Changed log level from WARNING to DEBUG for non-IP lines since
  external data sources naturally contain various formats (IPv6,
  plain IPs without comments, other strings)
- Added comprehensive tests for is_valid_ipv4() covering edge cases:
  * Valid IPs with/without whitespace
  * Out-of-range octets (>255)
  * Incomplete/malformed IPs
  * IPv6 addresses (correctly rejected)
  * Random strings and special characters
- Added MassScannersCron integration tests using real-world examples:
  * Plain IPs without comments
  * IPs with comments (various formats)
  * IPv6 addresses (should be skipped)
  * Invalid strings like /w00tw00t.at.ISC.SANS.DFind:)
  * Mixed valid/invalid data

All tests pass (67 total: 52 extraction utils + 15 mass scanners)

Fixes issues with 'unexpected line' warnings for valid data formats
that don't match the old strict regex pattern
@opbot-xd
Copy link
Copy Markdown
Contributor Author

opbot-xd commented Jan 6, 2026

Test Coverage Rationale

Added 26 comprehensive tests because:

  1. External data variability - The upstream source contains diverse formats (plain IPs, IPs with comments, IPv6, random strings) that we don't control
  2. Core utility function - is_valid_ipv4() is now used by multiple modules, requiring thorough edge case validation
  3. Regression prevention - Ensures the flexible parser correctly accepts valid IPs while rejecting invalid inputs
  4. Documentation - Tests serve as living documentation of expected input formats and behavior

All 67 tests pass, giving confidence this handles real-world threat intelligence data correctly. ✅

@opbot-xd opbot-xd marked this pull request as ready for review January 6, 2026 14:26
@opbot-xd
Copy link
Copy Markdown
Contributor Author

opbot-xd commented Jan 6, 2026

@regulartim Please review this PR when you get time...

Copy link
Copy Markdown
Member

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, waiting for a thumbs up from Tim too

Copy link
Copy Markdown
Collaborator

@regulartim regulartim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to the party. Very nice work @opbot-xd ! Just a few minor things.

# Simple regex to extract potential IPv4 addresses
ip_candidate_regex = re.compile(r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})")
# Regex to extract optional comment/reason after '#'
comment_regex = re.compile(r"#\s*(.+)", re.DOTALL)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this was in the code already, but why do we have re.DOTALL here? To capture comments spanning multiple lines? @mlodic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You're right - re.DOTALL is likely unnecessary here since we're already processing the data line-by-line with r.iter_lines().

The comment extraction happens on individual lines, so newlines within a single line shouldn't occur. I kept re.DOTALL from the original implementation to maintain the same behavior, but it's probably redundant.

I can remove it and simplify to:

comment_regex = re.compile(r"#\s*(.+)")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what @mlodic says. He surely had a reason to put it in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated whether multi-line comments actually occur:
Finding: They don't. After checking the actual data source, the format is strictly:
IP # single-line comment

Why re.DOTALL can't help anyway:

  • r.iter_lines() splits the HTTP response by newlines
  • Each iteration processes one line (newlines are delimiters, not content)
  • Even if a comment had \n in it, we'd never see it - it would be split into separate lines before reaching our regex

Copy link
Copy Markdown
Contributor Author

@opbot-xd opbot-xd Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion! When I made the changes, I only saw the two specific review comments (about re.DOTALL and log placement), so I addressed both and pushed.

Your comment asking to defer to @mlodic loaded after I had already committed and pushed - GitHub timing issue on my end. (forgot to refresh the page.) 😅

I've already pushed the changes addressing both review points, but I'm happy to revert if @mlodic prefers to keep re.DOTALL for any reason I'm not aware of. Just let me know!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for letting me check but honestly I don't remember. Let's keep your change

- Remove redundant re.DOTALL flag from comment_regex
  Since we process line-by-line with iter_lines(), multi-line comments
  cannot occur (newlines are delimiters, not content). Flag is unnecessary.
- Move logging after save() to avoid misleading logs if DB operation fails
  Ensures we only log on successful database saves
@mlodic mlodic merged commit bdf1a18 into intelowlproject:develop Jan 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants