expose tags in API responses and add tag-based filtering. Closes #522#899
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements Phase 2 of the IP enrichment tagging feature by exposing IOC tags in JSON API responses and enabling tag-based filtering on the authenticated advanced feeds endpoint.
Changes:
- Adds
tagsto IOC objects returned by/api/feeds/advanced/,/api/feeds/(paginated JSON), and/api/enrichment. - Implements tag-based filtering (
tag_key,tag_value) for/api/feeds/advanced/only. - Adds a dedicated test suite to validate tag inclusion, structure, filtering behavior, and access controls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
api/views/utils.py |
Adds tag filter parameters, tag-based queryset filtering (advanced-only), and batch tag fetch to avoid N+1 queries in JSON feed responses. |
api/views/feeds.py |
Enables tag filtering for feeds_advanced and documents new query parameters. |
api/serializers.py |
Adds TagSerializer, exposes tags through IOCSerializer, and updates request/response serializers to accept/describe tag fields. |
tests/api/views/test_feeds_tags.py |
Introduces comprehensive tests for tag exposure and tag filtering behavior across feeds and enrichment endpoints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @regulartim the PR is ready for review. Review it when you get time. Thanks. |
|
Hey @opbot-xd ! There is a merge conflict, probably due to a recent merge. Could you please solve it? I am planning to review Monday morning (european time 😉). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
greedybear/migrations/0038_tag.py:21
- This migration file is being modified to add db_index=True on existing fields. In Django, migrations are intended to be immutable once merged/applied; editing an existing migration can leave already-migrated databases without the new indexes while fresh installs get them, causing schema drift.
Instead, revert changes to this existing migration and add a new migration (e.g. AlterField/AddIndex) that adds the desired indexes.
tests/api/views/test_feeds_tags.py:75
- These filter tests don’t currently assert that IOCs with non-matching tags (e.g. self.ioc_2, which has a different tag_key) are excluded. As written, the test would still pass if the endpoint returned extra IOCs alongside the expected one.
Consider asserting that all returned IOCs match the requested tag criteria (or at least explicitly assert self.ioc_2 is not present) to prevent false positives.
def test_200_filter_by_tag_key(self):
"""Filtering by tag_key should return only IOCs with matching tags."""
response = self.client.get("/api/feeds/advanced/?tag_key=malware")
self.assertEqual(response.status_code, 200)
iocs = response.json()["iocs"]
values = {i["value"] for i in iocs}
self.assertIn(self.ioc.name, values)
self.assertNotIn(self.ioc_domain.name, values)
api/views/utils.py:81
- FeedRequestParams’ class docstring lists the supported attributes but doesn’t mention the newly added tag_key/tag_value fields. Updating the attribute list would keep the inline documentation accurate for future maintainers.
self.verbose = query_params.get("verbose", "false").lower()
self.paginate = query_params.get("paginate", "false").lower()
self.format = query_params.get("format_", "json").lower()
self.tag_key = query_params.get("tag_key", "").strip()
self.tag_value = query_params.get("tag_value", "").strip()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
greedybear/models.py
Outdated
| key = models.CharField(max_length=128, db_index=True) | ||
| value = models.CharField(max_length=256, db_index=True) |
There was a problem hiding this comment.
Adding a plain btree index on Tag.value is unlikely to help the new tag_value substring filter, since PostgreSQL typically cannot use a btree index for ILIKE/__icontains queries. This adds write/storage overhead without improving the main query.
If tag_value filtering needs to be performant, consider a trigram GIN/GiST index (pg_trgm) or changing the filter semantics to equality/prefix where an index can be used.
There was a problem hiding this comment.
Hi @regulartim
When I was integrating Phase 2, my PR had previously created the 0038_add_tag_model migration file. Due to copilot review feedback, i needed to optimize the database queries by adding db_index=True directly into the Tag model table. To avoid polluting the DB migrations with two sequential model modifications back-to-back, I deleted 0038 and completely regenerated it as a clean new migration called 0038_tag.py.
In the meantime, develop was merged into my branch, bringing in a new migration 0039 which had 0038_add_tag_model hardcoded as its dependency. I manually updated dependencies = [('greedybear', '0038_tag')] inside 0039_ioc_attacker_country_sensor_country_and_more.py to fix this clash and restore the DB schema graph cleanly!
is it fine or would it create some complications?
There was a problem hiding this comment.
Hey @opbot-xd ! :) I think it's better to just create a new migration. Changing old migration files after they landed in a branch that other people could use, might break their instances in unpredictable ways.
I think your changes DO affect user-facing behavior as the API responses contain additional fields now, right? |
55a4e61 to
498807c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Yes, would raise a PR at docs repo once this get merged. Updated the description. Also sorry for late responses, was busy with university midterm exams. |
No problem, take your time! :) If you want to pause your work on this, feel free to just convert the PR to a draft again. This is not blocking anything and I think you are a reliable person, so I don't mind if you pause your work for a few days or even weeks. |
Thanks for understanding! My exams are over now, so I’ll like to resume work on this. |
regulartim
left a comment
There was a problem hiding this comment.
Hey @opbot-xd ! :) I still see a few issues here:
- I am not sure about queriing tags in the feed_response function, see comment
- there is a pending migration, you have to run makemigrations again
api/views/utils.py
Outdated
| # Annotate tags via a DB JOIN rather than a Python-level join. | ||
| # `tags_json` avoids conflicting with the `tags` reverse FK on IOC. | ||
| # For paginated (list) paths the queryset is already evaluated, so we | ||
| # run a second targeted query on the slice IDs instead. | ||
| tags_lookup: dict = {} | ||
| if isinstance(iocs, list): | ||
| ioc_ids = [ioc.id for ioc in iocs if hasattr(ioc, "id")] | ||
| if ioc_ids: | ||
| tag_rows = ( | ||
| IOC.objects.filter(id__in=ioc_ids) | ||
| .annotate( | ||
| tags_json=ArrayAgg( | ||
| JSONObject(key=F("tags__key"), value=F("tags__value"), source=F("tags__source")), | ||
| filter=Q(tags__isnull=False), | ||
| default=Value([]), | ||
| distinct=True, # prevent duplication from the honeypot JOIN | ||
| ) | ||
| ) | ||
| .values("id", "tags_json") | ||
| ) | ||
| tags_lookup = {row["id"]: row["tags_json"] for row in tag_rows} | ||
| else: | ||
| iocs = iocs.annotate( | ||
| tags_json=ArrayAgg( | ||
| JSONObject(key=F("tags__key"), value=F("tags__value"), source=F("tags__source")), | ||
| filter=Q(tags__isnull=False), | ||
| default=Value([]), | ||
| distinct=True, # prevent duplication from the honeypot JOIN | ||
| ) | ||
| ) | ||
| required_fields = tuple(f if f != "tags" else "tags_json" for f in required_fields) | ||
|
|
||
| # Collect values; `honeypots` will contain the list of associated honeypot names | ||
| iocs = (ioc_as_dict(ioc, required_fields) for ioc in iocs) if isinstance(iocs, list) else iocs.values(*required_fields) | ||
| for ioc in iocs: | ||
| # Generate dictionaries iterably to avoid holding both dicts and json_list in memory | ||
| iocs_iter: object | ||
| if isinstance(iocs, list): | ||
| iocs_iter = (ioc_as_dict(ioc, set(required_fields) | {"id"}) for ioc in iocs) # id needed for tags_lookup | ||
| else: | ||
| iocs_iter = iocs.values(*required_fields).iterator(chunk_size=2000) | ||
| for ioc in iocs_iter: |
There was a problem hiding this comment.
Why do you fetch the tag here and not in get_queryset? There you could just join them like the general honeypots:
if not is_aggregated:
iocs = iocs.filter(general_honeypot__active=True)
iocs = iocs.annotate(honeypots=ArrayAgg("general_honeypot__name"))
iocs = iocs.annotate(tags_json=ArrayAgg(...)
iocs = iocs.order_by(feed_params.ordering)
iocs = iocs[: int(feed_params.feed_size)]There was a problem hiding this comment.
This would be much simpler as we don't need to cover the case where iocs is no queryset.
There was a problem hiding this comment.
Nice!
This approach is much cleaner! Though I had to also add the tags_json annotation to the repository methods (get_scanners_by_pks and get_recent_scanners) since they bypass get_queryset but still call feeds_response for ML scoring workflows.
The benefit is worth it though.
Changes made:
- Moved
tags_jsonannotation toget_querysetalongsidehoneypots - Removed
tags_lookupdictionary logic fromfeeds_response - Added
tags_jsonannotation toIocRepository.get_scanners_by_pks()andIocRepository.get_recent_scanners()
…ndex in 0040 - Rename 0038_tag.py → 0038_add_tag_model.py to match upstream/develop exactly - Remove db_index=True from 0038 (keep it identical to upstream) - Add 0040_tag_key_index.py: separate migration for Tag.key btree index on top of upstream's 0039 (GeoIP attacker_country/sensor.country)
…ject) for tags - Remove _prefetch_tags helper (manual reimplementation of prefetch_related) - Annotate tags directly on the queryset using ArrayAgg(JSONObject(key, value, source)) with Q(tags__isnull=False) to skip IOCs without tags and distinct=True to prevent duplication from the general_honeypot JOIN - For paginated (list) paths: run a targeted ArrayAgg query on the slice IDs - Use annotation name tags_json to avoid conflict with the tags reverse FK on IOC - Remove redundant comments, keep only non-obvious ones Addresses code review feedback from regulartim.
Pass tag_key and tag_value as explicit kwargs to get_queryset instead of storing them in FeedRequestParams. They are read directly from request.query_params only in feeds_advanced, so the standard feeds endpoint can never trigger tag filtering regardless of what query params are passed. Removes enable_tag_filtering flag entirely. Suggested by regulartim.
- Annotate tags_json in get_queryset alongside honeypots annotation - Simplifies feeds_response by removing dual list/queryset handling - Add tags_json to repository methods (get_scanners_by_pks, get_recent_scanners) - Eliminates separate query for paginated IOCs Addresses regulartim feedback.
cbcc2f1 to
e4083e9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Only annotate tags_json when format is JSON to avoid unnecessary JOINs and aggregation for txt/csv downloads - Add distinct=True to honeypots ArrayAgg in repository methods to prevent duplicate names when IOCs have multiple tags Addresses Copilot feedback.
regulartim
left a comment
There was a problem hiding this comment.
Looks good, thanks @opbot-xd ! :) Only minor details left.
There was a problem hiding this comment.
If I understand your code correctly, the changes to this file should server the tags to the ML jobs, right? But there, they are thrown away again, because get_features does not know of them.
Also, I am not a 100% sure if it is a good idea to use the Tags for training. This might introduce weird biases. I had the similar doubts when I added the ip_reputation as a feature, but decided to include it after testing the inclusion thoroughly.
So, bottom line: I would prefer not to add the Tags to scoring for now. This would be a separate issue / PR requiring a lot of research and testing.
There was a problem hiding this comment.
Thanks for pointing this out!
Updated the logic to handle both cases correctly:
if isinstance(iocs, list):
has_tags_annotation = bool(iocs) and hasattr(iocs[0], "tags_json")
else:
has_tags_annotation = "tags_json" in getattr(
iocs, "query", type("", (), {"annotations": {}})()"
).annotations…nd adapt API views to conditionally include it and truncate tag query parameters.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
api/views/utils.py:306
- In the
isinstance(iocs, list)branch,set(required_fields)is rebuilt for every IOC because it’s inside the generator expression. Precomputing the set once (e.g.,fields_set = set(required_fields)) and reusing it will avoid repeated allocations when paginating large result sets.
iocs_iter: object
if isinstance(iocs, list):
iocs_iter = (ioc_as_dict(ioc, set(required_fields)) for ioc in iocs)
else:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…ers in advanced feed view.
regulartim
left a comment
There was a problem hiding this comment.
Alright, thanks for your work! :)
…lowlproject#522 (intelowlproject#899) * feat: expose tags in API responses and add tag-based filtering * feat(api): expose tags in API responses and add tag-based filtering (intelowlproject#522) * refactor(migrations): rename 0038 to match upstream and add Tag.key index in 0040 - Rename 0038_tag.py → 0038_add_tag_model.py to match upstream/develop exactly - Remove db_index=True from 0038 (keep it identical to upstream) - Add 0040_tag_key_index.py: separate migration for Tag.key btree index on top of upstream's 0039 (GeoIP attacker_country/sensor.country) * refactor(feeds): replace _prefetch_tags with DB-level ArrayAgg(JSONObject) for tags - Remove _prefetch_tags helper (manual reimplementation of prefetch_related) - Annotate tags directly on the queryset using ArrayAgg(JSONObject(key, value, source)) with Q(tags__isnull=False) to skip IOCs without tags and distinct=True to prevent duplication from the general_honeypot JOIN - For paginated (list) paths: run a targeted ArrayAgg query on the slice IDs - Use annotation name tags_json to avoid conflict with the tags reverse FK on IOC - Remove redundant comments, keep only non-obvious ones Addresses code review feedback from regulartim. * refactor(feeds): move tag_key/tag_value out of FeedRequestParams Pass tag_key and tag_value as explicit kwargs to get_queryset instead of storing them in FeedRequestParams. They are read directly from request.query_params only in feeds_advanced, so the standard feeds endpoint can never trigger tag filtering regardless of what query params are passed. Removes enable_tag_filtering flag entirely. Suggested by regulartim. * refactor(feeds): move tags_json annotation to get_queryset - Annotate tags_json in get_queryset alongside honeypots annotation - Simplifies feeds_response by removing dual list/queryset handling - Add tags_json to repository methods (get_scanners_by_pks, get_recent_scanners) - Eliminates separate query for paginated IOCs Addresses regulartim feedback. * perf: optimize tags_json annotation and fix honeypots deduplication - Only annotate tags_json when format is JSON to avoid unnecessary JOINs and aggregation for txt/csv downloads - Add distinct=True to honeypots ArrayAgg in repository methods to prevent duplicate names when IOCs have multiple tags Addresses Copilot feedback. * Refactor: Remove `tags_json` annotation from IOC repository queries and adapt API views to conditionally include it and truncate tag query parameters. * refactor: remove redundant assignment of verbose and paginate parameters in advanced feed view.
Description
Implemented Phase 2 of the IP enrichment feature by exposing tags and enabling tag-based filtering.
Key changes:
Tag.objects.filter(ioc_id__in=...))./api/enrichment/endpoint.Related issues
Closes #522
Type of change
Checklist
Please complete this checklist carefully. It helps guide your contribution and lets maintainers verify that all requirements are met.
Formalities
<feature name>. Closes #999develop.develop.Docs and tests
Ruff) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.GUI changes
Ignore this section if you did not make any changes to the GUI.