Summary
The codebase constructs a column-dependency DAG in two separate places using two different graph implementations:
-
dag.py (topologically_sort_column_configs) — uses networkx (lazy.nx.DiGraph) to build a graph from required_columns, side_effect_columns, and skip.columns, then returns a topologically sorted list of ColumnConfigT. Called once during config compilation in config_compiler.py.
-
ExecutionGraph (execution_graph.py) — uses a hand-rolled adjacency-list graph with Kahn's algorithm. Builds edges from the same three dependency sources. Created later at generator-init time by _initialize_generators_and_graph. Used by both sequential (skip logic) and async (scheduler) paths.
Both construct a DAG from the same dependency sources, both validate acyclicity, and both compute topological order. The only difference is timing (pre-compilation vs post-compilation) and what metadata they carry.
Problems
- Duplicated edge logic —
_add_dependency_edges in dag.py and the second pass in ExecutionGraph.create implement the same side-effect resolution and skip-column edge logic. A bug fix in one must be mirrored in the other.
- Unnecessary networkx dependency — networkx is a heavy library used in the DAG path for exactly one function (topological sort) that
ExecutionGraph already implements natively.
- Two cycle checks run on the same data at different times.
Proposed Solution
Make ExecutionGraph the single graph abstraction. Add a topologically_sort_column_configs classmethod that accepts raw ColumnConfigT (like dag.py does today) without requiring strategies:
- Filters DAG vs non-DAG columns using
column_type_used_in_execution_dag
- Builds edges from
required_columns, side_effect_columns, skip.columns
- Runs Kahn's algorithm (already implemented in
get_topological_order)
- Returns the sorted
ColumnConfigT list
Then:
- Update
config_compiler.py to call ExecutionGraph.topologically_sort_column_configs instead of importing from dag.py.
- Delete
dag.py.
- Update
test_dag.py to import from ExecutionGraph.
Files Involved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py — delete
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py — add classmethod
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py — update import
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py — update import
architecture/dataset-builders.md — update docs to reflect single graph
Alternative
Extract a shared ColumnDependencyGraph base class with pure graph operations (nodes, edges, topo sort, cycle check), then have ExecutionGraph extend it with strategy/skip metadata. Cleaner separation but more refactoring.
Summary
The codebase constructs a column-dependency DAG in two separate places using two different graph implementations:
dag.py(topologically_sort_column_configs) — uses networkx (lazy.nx.DiGraph) to build a graph fromrequired_columns,side_effect_columns, andskip.columns, then returns a topologically sorted list ofColumnConfigT. Called once during config compilation inconfig_compiler.py.ExecutionGraph(execution_graph.py) — uses a hand-rolled adjacency-list graph with Kahn's algorithm. Builds edges from the same three dependency sources. Created later at generator-init time by_initialize_generators_and_graph. Used by both sequential (skip logic) and async (scheduler) paths.Both construct a DAG from the same dependency sources, both validate acyclicity, and both compute topological order. The only difference is timing (pre-compilation vs post-compilation) and what metadata they carry.
Problems
_add_dependency_edgesindag.pyand the second pass inExecutionGraph.createimplement the same side-effect resolution and skip-column edge logic. A bug fix in one must be mirrored in the other.ExecutionGraphalready implements natively.Proposed Solution
Make
ExecutionGraphthe single graph abstraction. Add atopologically_sort_column_configsclassmethod that accepts rawColumnConfigT(likedag.pydoes today) without requiring strategies:column_type_used_in_execution_dagrequired_columns,side_effect_columns,skip.columnsget_topological_order)ColumnConfigTlistThen:
config_compiler.pyto callExecutionGraph.topologically_sort_column_configsinstead of importing fromdag.py.dag.py.test_dag.pyto import fromExecutionGraph.Files Involved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py— deletepackages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py— add classmethodpackages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py— update importpackages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py— update importarchitecture/dataset-builders.md— update docs to reflect single graphAlternative
Extract a shared
ColumnDependencyGraphbase class with pure graph operations (nodes, edges, topo sort, cycle check), then haveExecutionGraphextend it with strategy/skip metadata. Cleaner separation but more refactoring.