feat(api): add ASN-aggregated IOC statistics . CLOSES #458#718
feat(api): add ASN-aggregated IOC statistics . CLOSES #458#718regulartim merged 7 commits intointelowlproject:developfrom
Conversation
regulartim
left a comment
There was a problem hiding this comment.
Nice, thank you!
A few general things:
- In my mind the aggregation would be done on a database level. Your approach also works, of course, but I am a little bit concerned regarding performance on instances with millions of IoCs.
- Ordering is kind of tricky here. I tested your code and I was not able to order by
ioc_count. But intuitively I would expect that to work. I think we should support to order by any field that is present in the output, if sensible (ordering byhoneypotsis not). - The IOC model has more field that have not been included here yet. Do you think you could also include
first_seenandlast_seen? Or do you think that this is not useful?
Hi, @regulartim, Thanks a lot for the review and the feedback! I wanted to explain a bit why I initially went with Python-level aggregation and where I’m currently unsure, so I’d really appreciate your guidance. My first thought was also to aggregate at the database level using values("asn") + annotate(...). However, I ran into a couple of structural issues with the current setup of get_queryset, which is shared by other API: get_queryset applies slicing ([:feed_size]) after ordering. Once a queryset is sliced, Django does not allow further aggregation, which makes DB-level grouping impossible unless we disable slicing using a flag (doSlice=True). However i was afraid to change the fx which is used by other apis without your permission. And If we slice before aggregation, the aggregation sees only the first feed_size raw IOCs. This produces incorrect totals (ioc_count, attack_count, etc.), because other IOCs outside the slice are ignored. For example, if ASN 13335 has 10,000 IOCs, slicing at 5000 will undercount. When aggregating by ASN and joining against general_honeypot, each IOC can appear multiple times at the sql level (one row per honeypot). This leads to inflated sums (e.g., attack_count, interaction_count). For example, a single IOC linked to two honeypots is counted twice in SUM(attack_count) in aggregation . I guees it is many to many joins and duplicated rows issue. I think that get_queryset needs a big refactor here I’d love to hear what you think is the best tradeoff here, especially given the shared nature of get_queryset. I’m happy to adapt the implementation based on your recommendation. |
|
Hey @drona-gyawali ! Thanks for your detailed explanation. I think the best approach would be to aggregate on the DB level. If this makes it necessary to refactor or even split up
This is kind of strange. If a aggregation function is used on the |
Extremely sorry for the late reply! I used ArrayAgg(distinct=True) for honeypots and distinct=True on all other fields to avoid inflated sums, since thing were failing without it. The iocs_qs comes from get_queryset. raw code for reference: I’ll push the new version soon. |
Don't worry, we all have other stuff to do! :)
Cool, looking forward to that! |
|
Hi @regulartim , In this new version, I implemented DB-level aggregation for the ASN feed. While building this, I had to introduce a few things: New serializer (ASNFeedsOrderingSerializer) – I inherited from FeedsRequestSerializer because the base serializer already provides default handling for parameters like max_age, feed_type, and attack_type. The main reason for creating the new serializer was that the base serializer’s ordering validation is strict and only allows model fields. Since aggregation introduces annotated/non-model fields (like ioc_count, total_attack_count), I needed to add custom validation here. resolve_aggregation_ordering utility – This ensures that our aggregation endpoint defaults to ordering by -ioc_count instead of -last_seen, bypassing the default injection from feed_params. I tried to make this dynamic so that any future aggregation API can leverage the same pattern without reinventing the wheel. The overall goal was to make the developer experience better and avoid complexity when building future aggregation endpoints. Anyone adding a new aggregation API only needs to inherit and customize the ordering/validation, without rewriting everything. I hope this aligns with your expectations. If you feel any part of this design needs changes, I’m always open to feedback and happy to adjust. |
regulartim
left a comment
There was a problem hiding this comment.
Good work! 👍 This basically looks good to merge, I only found two minor things.
Since this is a new API endpoint, it would be great to include it into our documentation. Would you please do that? If so, do you need guidance for that?
Raised a pr in our docs repo: intelowlproject/docs#46 |
Description
This change introduces a new authenticated API endpoint that aggregates IOC data by ASN. The endpoint groups all matching IOCs under their respective ASNs and computes summary statistics, including IOC count, total attack count, total interaction count, total login attempts, expected IOC count (derived from recurrence probability), and expected interactions. It also returns the set of unique honeypots associated with each ASN. The implementation reuses the same filtering and authentication logic as the Advanced Feeds API to avoid code duplication, while intentionally returning a JSON-only response tailored for aggregated data use cases.
Related issues
closes #458
Type of change
Please delete options that are not relevant.
Checklist
develop.Black,Flake,Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.Important Rules