Skip to content

Refactor extraction process. Closes #622.#624

Merged
regulartim merged 23 commits intodevelopfrom
refactor_extraction_process
Dec 20, 2025
Merged

Refactor extraction process. Closes #622.#624
regulartim merged 23 commits intodevelopfrom
refactor_extraction_process

Conversation

@regulartim
Copy link
Copy Markdown
Member

@regulartim regulartim commented Dec 18, 2025

Description

This PR introduces a complete rework of the extraction process. The idea is to improve testability, extensibility and maintainability by following some best practices:

  • repository pattern: repositories handle data access without containing any processing logic
  • single responsibility: every class in the process has one clear and recognizable responsibility
  • dependency injection: dependencies are injected through constructors which makes testing much easier
  • strategy pattern: makes it easier to add new "special treatment" for honeypots

The new process flow looks like this:

sequenceDiagram
    participant Job as ExtractionJob
    participant Pipeline as ExtractionPipeline
    participant Elastic as ElasticRepository
    participant Factory as StrategyFactory
    participant Strategy as ExtractionStrategy
    participant Processor as IocProcessor
    participant Repo as IocRepository
    
    Job->>Pipeline: execute()
    Pipeline->>Elastic: search(minutes_back)
    Elastic-->>Pipeline: hits[]
    
    loop Each honeypot
        Pipeline->>Factory: get_strategy(honeypot)
        Factory-->>Pipeline: strategy
        Pipeline->>Strategy: extract_from_hits(hits)
        Strategy->>Strategy: iocs_from_hits(hits)
        
        loop Each IOC
            Strategy->>Processor: add_ioc(ioc)
            Processor->>Repo: get_ioc_by_name(name)
            alt IOC exists
                Processor->>Processor: merge_iocs()
                Processor->>Repo: save(ioc)
            else New IOC
                Processor->>Repo: save(ioc)
            end
        end
    end
    
    Pipeline->>Pipeline: UpdateScores()
Loading

A single ExtractionPipeline instance orchestrates the extraction of all available honeypots. Is uses the ElasticRepository to receive a list of all honeypot hits from a certain time window. For each honeypot it gets the corresponding ExtractionStrategy, which contains all the extraction logic that is specific for a certain type of honeypot (e.g. Cowrie). The ExtractionStrategy uses this logic to create IOC objects and hands them to the IocProcessor, which is responsible for - well - processing them so they can be written to the database via the IocRepository.

Key changes (functional)

  • Sensors are now extracted in every extraction run. No extra job needed.
  • General honeypots that are not in the database yet, are automatically added and extracted (until disabled manually).

Next steps

  • Thoroughly test the new process in a production-like environment. Although I wrote a lot of tests, we might still find some bugs, as the extraction process is quite complex. This should be done before we merge the changes to main.
  • Create a honeypot exclusion list, which contains all honeypots that we do not want to have in our database (e.g. Ddospot) and stop them from being extracted.
  • Remove the hard-coded "general honeypots".
  • Refactor the Cowrie extraction process (=CowrieExtractionStrategy) and write tests for it.
  • Write end-to-end pipeline tests. This should be done after Cowrie extraction is refactored.
  • Use the repositories for other purposes as well (e.g. scoring).

(I will open separate issues / PRs for them.)

Related issues

Type of change

  • Bug fix (non-breaking change which fixes an issue).

Checklist

  • I have read and understood the rules about how to Contribute to this project.
  • The pull request is for the branch develop.
  • I have added documentation of the new features.
  • Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
  • If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
  • If the GUI has been modified:
    • I have a provided a screenshot of the result in the PR.
    • I have created new frontend tests for the new component or updated existing ones.

Important Rules

  • If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
  • If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
  • Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

@regulartim regulartim marked this pull request as ready for review December 18, 2025 14:16
@regulartim regulartim requested a review from mlodic December 18, 2025 14:17
@regulartim
Copy link
Copy Markdown
Member Author

Sorry @mlodic for the huge amount of changes in a single PR. But don't be scared, most of the lines are doc strings and tests anyway. :D

Copy link
Copy Markdown
Collaborator

@mlodic mlodic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job! can you also add the schema and its explanation in a separate .md file in the root of the greedybear folder? In that way it is easier to find its reference, otherwise it would be easily lost between all the PRs.

@regulartim regulartim merged commit d5a9906 into develop Dec 20, 2025
5 checks passed
@regulartim regulartim deleted the refactor_extraction_process branch December 20, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants