Strategy-specific E2E tests and edge cases for ExtractionPipeline. Closes #636 by opbot-xd · Pull Request #740 · GreedyBear-Project/GreedyBear

opbot-xd · 2026-01-28T12:25:04Z

Description

This PR adds comprehensive strategy-specific E2E tests and edge cases for the ExtractionPipeline class, completing the test coverage requested in issue #636.

Building on PR #735 (pipeline infrastructure + core flow tests), this PR adds:

New Test Classes (24 tests total, ~795 lines):

TestCowrieStrategyE2E (3 tests)
- Scanner extraction flow with session hits
- Payload extraction from login messages with embedded URLs
- File download URL extraction
TestLog4potStrategyE2E (3 tests)
- JNDI/LDAP exploit extraction flow
- Base64-encoded payload extraction with hidden URLs
- Non-exploit hits filtering verification
TestGenericStrategyE2E (3 tests)
- Fallback behavior for unknown honeypots
- Heralding honeypot processing
- Dionaea honeypot processing
TestMixedHoneypotE2E (1 test)
- Mixed honeypot hits with correct strategy selection
TestEdgeCasesE2E (10 tests)
- Malformed hits with missing required fields
- Sensor extraction validation
- Honeypot skipping when not ready (is_ready_for_extraction returning False)
- Empty IOC records handling
- Multiple strategy exceptions handling
- Partial strategy success scenarios
- Hit grouping verification
- Special characters in fields
- Large batch processing (1000 hits)
TestFactoryIntegration (4 tests)
- Factory creates correct strategy for Cowrie
- Factory creates correct strategy for Log4pot
- Factory creates generic strategy for unknown honeypots
- Case-sensitive honeypot name matching

Related issues

Closes #636

Type of change

New feature (non-breaking change which adds functionality).

Checklist

I have read and understood the rules about how to Contribute to this project.
The pull request is for the branch develop.
I have added documentation of the new features.
Linter (Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
If the GUI has been modified:
- I have a provided a screenshot of the result in the PR.
- I have created new frontend tests for the new component or updated existing ones.

…Closes GreedyBear-Project#636

Copilot

Pull request overview

This PR adds comprehensive end-to-end tests for the ExtractionPipeline class, building on the foundation laid in PR #735. The tests verify the complete extraction workflow from Elasticsearch hits through strategy selection to IOC persistence and scoring.

Changes:

Added 24 new tests (~795 lines) covering strategy-specific E2E scenarios, edge cases, and factory integration
Tests validate Cowrie, Log4pot, and Generic extraction strategies through the full pipeline
Edge case tests verify error handling, validation, and boundary conditions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…mocks - Replace weak assertGreaterEqual(result, 0) with specific mock.call_count assertions - Fix E2E tests to use proper ExtractionStrategyFactory mocking pattern - Remove unnecessary UpdateScores patch decorators from factory tests - Remove unused mock_scores parameters

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

opbot-xd · 2026-01-28T14:30:15Z

Coverage

Module	Coverage
`pipeline.py`	100%
`strategies/factory.py`	100%
`strategies/base.py`	100%
`strategies/generic.py`	100%
`strategies/__init__.py`	100%

All 446 tests passing.

Hello @regulartim the PR is ready for review.
I believe the test_extraction_pipeline.py is getting huge (1183 lines). Any suggestions for that?

regulartim

Nice! 👍 Two things might need to be changed though:

Yes, the file is too long. I thing you can split it up by test category (for example factory tests, edge cases, strategy E2E) and name them something like test_extraction_pipeline_factory.py.
In the strategy specific tests, you are mocking the ExtractionStrategyFactory and the strategies themselves. Is there a good reason to do this? Isn't it best to just mock the repository, such that we are testing the whole extraction process as it runs in production?

opbot-xd · 2026-01-29T10:43:09Z

Hi @regulartim
My initial approach was unit testing the pipeline orchestration verifying that it correctly coordinates between factory, strategies, and scoring. I mocked the factory/strategies to isolate the pipeline logic from strategy implementation details.

However, you're right that for true E2E testing, we should let the real components run and only mock at the repository boundary. This gives us confidence that the actual integration works. I'll refactor to use the real factory and strategies.

- Split monolithic test file into 4 focused files - E2E tests now use real ExtractionStrategyFactory and strategies - Only mock repositories at the boundary - Tests actual integration path as it runs in production

- test_honeypot_skipped_when_not_ready (grouping file) - test_strategy_returns_empty_ioc_records (E2E file) - test_partial_strategy_success (E2E file) - test_large_batch_of_hits (E2E file)

regulartim

Sorry, I just realized that none of the tests verifies that the extracted IOCs are actually populated. Could you add at least one E2E test that inspects the actual IOC content (IP, honeypot type) rather than just the count?

- Add TestIocContentVerification class with 3 tests for IOC content verification - Move E2ETestCase class to tests/__init__.py for shared usage (reviewer feedback) - Split edge cases into test_extraction_pipeline_edge_cases.py Edge cases now clearly document when mocking is required: - test_partial_strategy_success: Mocks factory (needs to force exception) - test_large_batch_of_hits_with_real_strategy: Uses REAL strategy Tests added: - test_cowrie_ioc_content_verified: Verifies IOC has correct IP - test_multiple_honeypots_ioc_content_verified: Verifies multiple IOCs - test_ioc_scanner_field_contains_honeypot_type: Verifies scanner field Addresses reviewer feedback to: 1. Verify actual IOC content, not just count 2. Move shared test infrastructure to tests/__init__.py 3. Keep test files focused and manageable in size 4. Use real strategies where possible in tests

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… call in e2e pipeline test

opbot-xd · 2026-01-29T14:11:54Z

Hi @regulartim

Regarding the TestIocContentVerification class in tests/greedybear/cronjobs/test_extraction_pipeline_e2e.py:

We encountered a trade-off between using real strategies (as requested) and having unconditional deterministic assertions.

Copilot issue: The tests use conditional assertions (if mock_scores.return_value.score_only.called:). This is technically non-deterministic because if extraction fails silently, the test still passes without verifying content.
The Conflict: Guaranteeing IOC extraction with real strategies requires complex setup (specific hit fields, database state via patched repositories like CowrieSessionRepository). If we don't mock the factory to force an IOC return, we rely on the real strategy's logic.

Decision:
I have prioritized using real production code.

The tests use real strategies and only mock the repositories (CowrieSessionRepository, IocRepository).
The assertions remain conditional: IF the real strategy extracts IOCs (which it should, given the mock hits), THEN we verify the content strictly.

I believe this adheres best to the goal of "testing the actual integration path" while still adding the requested content verification.

regulartim · 2026-01-29T14:16:40Z

Hey @opbot-xd ! Thank you for your work. Are you done with working on this? To me it looks good now. Would like to approve and merge.

opbot-xd · 2026-01-29T14:21:19Z

Yes I have finished my work over this PR. Should I update the PR desc?

regulartim · 2026-01-29T14:21:51Z

No, all good! :)

Add strategy-specific E2E tests and edge cases for ExtractionPipeline. …

f35ce0c

…Closes GreedyBear-Project#636

Copilot AI review requested due to automatic review settings January 28, 2026 12:25

Copilot started reviewing on behalf of opbot-xd January 28, 2026 12:25 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

opbot-xd marked this pull request as draft January 28, 2026 13:52

opbot-xd requested a review from Copilot January 28, 2026 14:25

Copilot AI reviewed Jan 28, 2026

View reviewed changes

opbot-xd marked this pull request as ready for review January 28, 2026 14:25

opbot-xd requested a review from Copilot January 28, 2026 14:25

Copilot started reviewing on behalf of opbot-xd January 28, 2026 14:25 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

regulartim requested changes Jan 29, 2026

View reviewed changes

opbot-xd added 2 commits January 29, 2026 16:19

refactor: split pipeline tests and use real factory/strategies in E2E

0f2231f

- Split monolithic test file into 4 focused files - E2E tests now use real ExtractionStrategyFactory and strategies - Only mock repositories at the boundary - Tests actual integration path as it runs in production

test: add back edge cases for pipeline tests

be02f92

- test_honeypot_skipped_when_not_ready (grouping file) - test_strategy_returns_empty_ioc_records (E2E file) - test_partial_strategy_success (E2E file) - test_large_batch_of_hits (E2E file)

opbot-xd requested a review from regulartim January 29, 2026 10:58

regulartim requested changes Jan 29, 2026

View reviewed changes

Comment thread tests/greedybear/cronjobs/test_extraction_pipeline_e2e.py Outdated

Copilot AI review requested due to automatic review settings January 29, 2026 13:30

Copilot started reviewing on behalf of opbot-xd January 29, 2026 13:31 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

opbot-xd added 2 commits January 29, 2026 19:08

Fix misleading comment in large batch test

ce87bac

test: explicitly assert IOC extraction count before verifying scoring…

ed038d6

… call in e2e pipeline test

regulartim mentioned this pull request Jan 29, 2026

Reduce memory usage by chunking Elasticsearch queries. Closes #630 #750

Merged

12 tasks

regulartim merged commit a5e95cb into GreedyBear-Project:develop Jan 29, 2026
4 checks passed

Uh oh!

Conversation

opbot-xd commented Jan 28, 2026

Description

New Test Classes (24 tests total, ~795 lines):

Related issues

Type of change

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

opbot-xd commented Jan 28, 2026

Coverage

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

opbot-xd commented Jan 29, 2026

Uh oh!

regulartim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

opbot-xd commented Jan 29, 2026

Uh oh!

regulartim commented Jan 29, 2026

Uh oh!

opbot-xd commented Jan 29, 2026

Uh oh!

regulartim commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants