Skip to content

[Feature] /dfd — extract Data Flow Diagram from code (multi-repo), with trust boundaries + classifications, emit Mermaid + Threat Dragon JSON #257

@atlas-apex

Description

@atlas-apex

User Story

As an adopter doing security review, compliance work, or onboarding a new team to a service, I want a /dfd skill that scans the codebase(s) to build a Data Flow Diagram showing external actors, processes, data stores, trust boundaries, and data classifications — so I get a single canonical DFD for the system that downstream skills (/threat-model, /compliance-check) can consume instead of each rebuilding their own.

Acceptance Criteria

Discovery (read first, ask later — same pattern as /process and /extract-features)

  • Skill scans across six DFD-discovery axes and reports findings before any question:
    1. External actors — user-facing endpoints (HTTP routes with public auth scope), auth providers (Auth0, Cognito, Clerk), admin interfaces, third-party API callers detected via webhook signatures, SDK keys, OAuth client registrations
    2. Processes — service handlers that transform data (HTTP route handlers, queue consumers, scheduled jobs, message-broker subscribers, gRPC service methods)
    3. Data stores — RDBMSes (Postgres, MySQL, SQLite — detected via ORM config + connection strings), document stores (Mongo, Dynamo), caches (Redis, Memcached), object storage (S3, GCS), file systems (local disk paths used as persistence), data warehouses (BigQuery, Snowflake, Redshift), search indexes (Elasticsearch, Algolia, Meilisearch)
    4. Data flows — what crosses what: request payload → handler → DB write, queue payload → consumer → external API call, scheduled job → DB read → S3 export, etc. Traced via the same call-graph reachability used by /process
    5. Trust boundaries — network boundaries (public ↔ internal VPC subnet, detected from IaC or env config), authentication transitions (anonymous → authenticated, user → admin, internal user → service account), org boundaries (us → third-party SaaS), data-classification transitions (PII enters a service that doesn't handle PII elsewhere)
    6. Data classifications — detected via: field-level annotations (@PII, @Sensitive, // CLASSIFIED: comments), env-var naming heuristics (*_SECRET, *_TOKEN, *_KEY, *_PASSWORD), schema column names matching PII patterns (email, phone, ssn, dob, address, name, ip_address), PCI-relevant patterns (card_number, cvv, exp_month), explicit data-classification registries if the project maintains one (docs/data-classification.{md,yaml})
  • Output of discovery: structured candidate model — [{actor_id, type, evidence: <file:line>, classification?}] + flows + boundaries — printed for operator review before final DFD is generated
  • Discovery is read-only; no files written until operator approves

Scoping

  • Skill takes an optional anchor to scope the DFD: --scope <service-name> (single service), --scope-all (every registered project — full system DFD), or default (asks operator: "Scope this DFD to one service, or the whole system?")
  • Single-service scope: reachability-bounded same as /process — only external actors that talk TO the service, data stores the service touches, data flows that cross the service's boundary
  • System-wide scope: walks the registry, builds per-service sub-DFDs, then composes one master DFD with trust boundaries between services

Cross-repo traversal (microservice architectures — same registry pattern as /process)

  • When the trace lands on another registered project (cross-service data flow), follow into that repo's source for the connected slice
  • Each managed service becomes a trust-boundary box in the DFD (services are isolation units — every cross-service flow crosses a boundary)
  • Unmanaged third parties (Stripe, SendGrid, Salesforce, etc.) render as external entities with their own trust-boundary marker

Interview (gap-fill only)

  • Skill asks operator to disambiguate when the code is silent: "this field user.identifier — is this an email, a UUID, or something else? Classification?"
  • Skill asks operator to confirm inferred trust boundaries: "I'm placing a trust boundary between the public API gateway and the internal services — confirm or override?"
  • Skill asks operator about external actors the code doesn't reveal: "any human admin actors who interact via a console / runbook outside this codebase?"

Output formats

  • Default: Mermaid flowchart at projects/<project>/architecture/dfd.md — renders inline on GitHub, the same format /threat-model embeds today (chore(#223): add Data Flow Diagram section to threat-model template #225). This becomes the single source of truth DFD — /threat-model consumes from here instead of regenerating its own
  • --format=dragon emits OWASP Threat Dragon v2 JSON (shares the serialiser added in [Feature] /threat-model --format=dragon flag for OWASP Threat Dragon JSON export #255) — for visual editing in Threat Dragon
  • --format=plantuml-dfd (optional, v2): PlantUML DFD syntax for adopters who use PlantUML toolchain
  • --format=all writes all configured formats in one go
  • Each DFD element carries provenance comments / metadata: source file:line where it was discovered, classification labels, trust-boundary rationale

Re-runs + cache

  • Re-running /dfd on the same scope OFFERS (default-no) to overwrite — same UX as /extract-features and /process
  • Stores discovery output at projects/<project>/architecture/dfd-source.{yaml,md} so re-runs can diff "what changed since last time" — useful for spotting newly-introduced data flows that may warrant security review

Refactor: extract /threat-model's DFD into /dfd as the producer

  • PR refactors /threat-model so its DFD section comes from /dfd output (read from projects/<project>/architecture/dfd.md), not regenerated internally. If /dfd hasn't been run, /threat-model OFFERS to run it first
  • /compliance-check (the GDPR/ePrivacy audit skill) gains a discovery step: read the DFD's data classifications and flow-targets to find cross-border data transfers, third-party processors, etc.
  • Coordinate the /threat-model and /compliance-check updates with the existing skill maintainers (mostly cosmetic given the framework is single-org)

Docs + AgDR

  • SKILL.md documents the six discovery axes + scoping + one worked example
  • AgDR captures: Mermaid as the primary output (renders on GitHub, no toolchain dep), Threat Dragon JSON as the secondary; classifications as a first-class concept (annotations + heuristics + explicit registry); single-source-of-truth refactor of /threat-model and /compliance-check
  • Index in projects/<project>/architecture/README.md updated to list the DFD alongside the C4 diagrams

Design Notes

Completes the "what we already have" visualization tooling family:

Skill Produces View Source
/extract-features Feature Inventory What Exhaustive code scan
/c4 C4 L1+L2 Mermaid Static topology System-boundary scan
/process (#256) BPMN 2.0 Dynamic control flow Anchor-scoped multi-repo trace
/dfd (this ticket) Mermaid + Threat Dragon JSON Data movement + trust + classification Reachability + classification heuristics
/threat-model STRIDE markdown Security analysis ON the DFD Consumes /dfd output
/compliance-check GDPR/ePrivacy audit Compliance analysis on data flows Consumes /dfd classifications
/threat-model --format=dragon (#255) Threat Dragon JSON Editable threat model Uses /dfd + STRIDE findings

Out of Scope

  • Manual hand-authoring of DFDs in the skill (operator should edit the Mermaid file directly OR re-run /dfd with corrected anchor/answers)
  • Real-time DFD updates (one-shot, not continuous — same as siblings)
  • Replacing dedicated SAST/DAST tools — /dfd informs threat modelling and compliance, doesn't substitute for actual security scans
  • Pretty graphical export (SVG/PNG) — that's what Threat Dragon does after JSON import
  • Reading IaC (Terraform, CloudFormation) for trust boundaries in v1 — single signal source (code + env config) for v1; IaC integration is a follow-up

Effort Estimate

TBD — L → XL. Discovery engine + classification heuristics + cross-service trust-boundary inference + Mermaid + Threat Dragon serialisers + /threat-model and /compliance-check refactor to consume.

Glossary

Term Definition
DFD Data Flow Diagram — shows external entities, processes, data stores, trust boundaries, and the data that crosses between them; foundation for STRIDE threat modelling
Trust boundary A line in a DFD where data crosses a privilege/network/ownership boundary (anonymous → authenticated, public ↔ internal, us → third-party); threats concentrate at boundaries
Data classification A label applied to data identifying its sensitivity tier (PII, PCI, secrets, internal, public); informs encryption-at-rest, access controls, and regulatory scope (GDPR, HIPAA, PCI-DSS)
Reachability-bounded Discovery scopes to what's connected to an anchor; doesn't blindly scan the whole repo for irrelevant data flows
Provenance The file:line evidence trail for every DFD element — why this actor/process/store was placed in the diagram, so re-runs can refresh accurately
Single source of truth One DFD per scope, consumed by /threat-model and /compliance-check — replaces the current pattern where each skill rebuilds its own DFD slice

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — plan-worthy, not urgentenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions