Skip to content

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph) #510

@nabinchha

Description

@nabinchha

Summary

The codebase constructs a column-dependency DAG in two separate places using two different graph implementations:

  1. dag.py (topologically_sort_column_configs) — uses networkx (lazy.nx.DiGraph) to build a graph from required_columns, side_effect_columns, and skip.columns, then returns a topologically sorted list of ColumnConfigT. Called once during config compilation in config_compiler.py.

  2. ExecutionGraph (execution_graph.py) — uses a hand-rolled adjacency-list graph with Kahn's algorithm. Builds edges from the same three dependency sources. Created later at generator-init time by _initialize_generators_and_graph. Used by both sequential (skip logic) and async (scheduler) paths.

Both construct a DAG from the same dependency sources, both validate acyclicity, and both compute topological order. The only difference is timing (pre-compilation vs post-compilation) and what metadata they carry.

Problems

  1. Duplicated edge logic_add_dependency_edges in dag.py and the second pass in ExecutionGraph.create implement the same side-effect resolution and skip-column edge logic. A bug fix in one must be mirrored in the other.
  2. Unnecessary networkx dependency — networkx is a heavy library used in the DAG path for exactly one function (topological sort) that ExecutionGraph already implements natively.
  3. Two cycle checks run on the same data at different times.

Proposed Solution

Make ExecutionGraph the single graph abstraction. Add a topologically_sort_column_configs classmethod that accepts raw ColumnConfigT (like dag.py does today) without requiring strategies:

  • Filters DAG vs non-DAG columns using column_type_used_in_execution_dag
  • Builds edges from required_columns, side_effect_columns, skip.columns
  • Runs Kahn's algorithm (already implemented in get_topological_order)
  • Returns the sorted ColumnConfigT list

Then:

  1. Update config_compiler.py to call ExecutionGraph.topologically_sort_column_configs instead of importing from dag.py.
  2. Delete dag.py.
  3. Update test_dag.py to import from ExecutionGraph.

Files Involved

  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py — delete
  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py — add classmethod
  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py — update import
  • packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py — update import
  • architecture/dataset-builders.md — update docs to reflect single graph

Alternative

Extract a shared ColumnDependencyGraph base class with pure graph operations (nodes, edges, topo sort, cycle check), then have ExecutionGraph extend it with strategy/skip metadata. Cleaner separation but more refactoring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    refactortriagedIssue reviewed and approved by a maintainer

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions