Problem
There is currently no canonical way to uniquely identify a workflow configuration. Multiple features need this primitive:
- Determining whether two configurations are the same (without writing ad-hoc field-by-field comparisons that drift as configs evolve)
- Recording exact provenance in dataset metadata / dataset cards
- Diffing configurations between runs
- Any future caching, deduplication, or compatibility-gating logic that needs to ask "is this the same config that produced X?"
Today, anyone wanting to answer that question has to compare ad-hoc subsets of fields, which is brittle and easy to get wrong (especially around nested column / model / sampler / processor configs).
Proposed solution
Provide a deterministic, content-addressable hash of a workflow config, computed from a normalized canonical dump of the identity-relevant subset.
What's hashed
| Include |
Why |
| Column configs (names, types, generator params, processors, validators) |
Changing any of these changes per-row output |
| Model configs (model identity + sampling params that affect generation) |
Different temperature/top-p → different distribution |
| Sampler / RNG seed (when set) |
Changes which samples come out |
| Seed dataset identity (path + content hash, or just content hash) |
Different inputs → different outputs |
| Buffer size |
Changes batch/row-group alignment of any checkpointed state |
| Exclude |
Why |
dataset_name, output path |
Path identity, not data identity |
| Concurrency / threading params |
Affect speed, not output |
| API keys, endpoints, logging |
Environment, not data |
Shape
A reasonable shape: sha256(json.dumps(normalized_config, sort_keys=True, separators=(",", ":"), default=str)).
When persisted alongside an artifact, store enough metadata to interpret the hash later:
config_hash_version lets the normalization scheme evolve over time: old-version hashes can be treated as "unknown identity" rather than as a definite mismatch when the scheme changes.
Custom columns: how deep does the hash go?
Custom column generators (registered via the plugin / entry-point system) need explicit treatment. The easy parts — registered type name and instance config / parameters — fit naturally into the same model as builtin column configs. The harder question is what to do about the implementation behind the registration: if a user edits their custom column's code without changing its name or params, the output changes but the config doesn't.
Three escalating levels of strictness:
| Level |
What's hashed |
Catches |
Misses |
| L1: config-only |
Registered type name + config dict |
Param changes, type swaps |
Impl edits, dependency upgrades |
| L2: + source hash |
L1 + inspect.getsource() of the registered class/function |
Most impl edits |
Whitespace/comment-only changes flip the hash; transitive changes via helpers |
| L3: + bytecode hash |
L1 + bytecode hash |
Impl edits resilient to cosmetics |
Transitive changes; non-Python deps |
Recommendation: L1 as the floor (what most users would expect), L2 as opt-in / best-effort with documented limitations. Bytecode hashing (L3) is more work for marginal gain. No level catches dependency upgrades or external-service drift — the hash defends against config drift, not all forms of behavior drift, and the docs should be explicit about that.
Plugin-specific caveats:
inspect.getsource() may fail for compiled or zipped plugins
- Bytecode is Python-version-specific; same source yields different bytecode across versions
- Treat unhashable plugins as "unknown identity" with a warning, not a hard error
Layered defense: schema validation at row-write time catches structural drift (added/removed columns, dtype shifts) regardless of whether the hash caught it. Hash + schema check cover most realistic drift hazards.
Acceptance criteria
- A public function (shape TBD during implementation — e.g.,
DataDesignerConfigBuilder.fingerprint() or a freestanding helper) that returns a deterministic hash for a given config
- Identical configs produce identical hashes across processes, Python versions, and module load orders
- Changing any field listed in Include changes the hash
- Changing any field listed in Exclude does not change the hash
- Custom column generators contribute at least their registered type name and instance config to the hash (L1). L2 source-hashing may be opt-in
- Plugins that can't be source-hashed (compiled, zipped, etc.) degrade gracefully with a warning, not a hard error
config_hash_version is exposed and bumps cleanly when the normalization scheme changes
- Tests cover the deterministic property and the include/exclude boundaries
Problem
There is currently no canonical way to uniquely identify a workflow configuration. Multiple features need this primitive:
Today, anyone wanting to answer that question has to compare ad-hoc subsets of fields, which is brittle and easy to get wrong (especially around nested column / model / sampler / processor configs).
Proposed solution
Provide a deterministic, content-addressable hash of a workflow config, computed from a normalized canonical dump of the identity-relevant subset.
What's hashed
dataset_name, output pathShape
A reasonable shape:
sha256(json.dumps(normalized_config, sort_keys=True, separators=(",", ":"), default=str)).When persisted alongside an artifact, store enough metadata to interpret the hash later:
{ "config_hash": "sha256:9f3a...d217", "config_hash_algo": "sha256", "config_hash_version": 1 }config_hash_versionlets the normalization scheme evolve over time: old-version hashes can be treated as "unknown identity" rather than as a definite mismatch when the scheme changes.Custom columns: how deep does the hash go?
Custom column generators (registered via the plugin / entry-point system) need explicit treatment. The easy parts — registered type name and instance config / parameters — fit naturally into the same model as builtin column configs. The harder question is what to do about the implementation behind the registration: if a user edits their custom column's code without changing its name or params, the output changes but the config doesn't.
Three escalating levels of strictness:
inspect.getsource()of the registered class/functionRecommendation: L1 as the floor (what most users would expect), L2 as opt-in / best-effort with documented limitations. Bytecode hashing (L3) is more work for marginal gain. No level catches dependency upgrades or external-service drift — the hash defends against config drift, not all forms of behavior drift, and the docs should be explicit about that.
Plugin-specific caveats:
inspect.getsource()may fail for compiled or zipped pluginsLayered defense: schema validation at row-write time catches structural drift (added/removed columns, dtype shifts) regardless of whether the hash caught it. Hash + schema check cover most realistic drift hazards.
Acceptance criteria
DataDesignerConfigBuilder.fingerprint()or a freestanding helper) that returns a deterministic hash for a given configconfig_hash_versionis exposed and bumps cleanly when the normalization scheme changes