Skip to content

Add a deterministic hash to uniquely identify a workflow config #584

@nabinchha

Description

@nabinchha

Problem

There is currently no canonical way to uniquely identify a workflow configuration. Multiple features need this primitive:

  • Determining whether two configurations are the same (without writing ad-hoc field-by-field comparisons that drift as configs evolve)
  • Recording exact provenance in dataset metadata / dataset cards
  • Diffing configurations between runs
  • Any future caching, deduplication, or compatibility-gating logic that needs to ask "is this the same config that produced X?"

Today, anyone wanting to answer that question has to compare ad-hoc subsets of fields, which is brittle and easy to get wrong (especially around nested column / model / sampler / processor configs).

Proposed solution

Provide a deterministic, content-addressable hash of a workflow config, computed from a normalized canonical dump of the identity-relevant subset.

What's hashed

Include Why
Column configs (names, types, generator params, processors, validators) Changing any of these changes per-row output
Model configs (model identity + sampling params that affect generation) Different temperature/top-p → different distribution
Sampler / RNG seed (when set) Changes which samples come out
Seed dataset identity (path + content hash, or just content hash) Different inputs → different outputs
Buffer size Changes batch/row-group alignment of any checkpointed state
Exclude Why
dataset_name, output path Path identity, not data identity
Concurrency / threading params Affect speed, not output
API keys, endpoints, logging Environment, not data

Shape

A reasonable shape: sha256(json.dumps(normalized_config, sort_keys=True, separators=(",", ":"), default=str)).

When persisted alongside an artifact, store enough metadata to interpret the hash later:

{
  "config_hash": "sha256:9f3a...d217",
  "config_hash_algo": "sha256",
  "config_hash_version": 1
}

config_hash_version lets the normalization scheme evolve over time: old-version hashes can be treated as "unknown identity" rather than as a definite mismatch when the scheme changes.

Custom columns: how deep does the hash go?

Custom column generators (registered via the plugin / entry-point system) need explicit treatment. The easy parts — registered type name and instance config / parameters — fit naturally into the same model as builtin column configs. The harder question is what to do about the implementation behind the registration: if a user edits their custom column's code without changing its name or params, the output changes but the config doesn't.

Three escalating levels of strictness:

Level What's hashed Catches Misses
L1: config-only Registered type name + config dict Param changes, type swaps Impl edits, dependency upgrades
L2: + source hash L1 + inspect.getsource() of the registered class/function Most impl edits Whitespace/comment-only changes flip the hash; transitive changes via helpers
L3: + bytecode hash L1 + bytecode hash Impl edits resilient to cosmetics Transitive changes; non-Python deps

Recommendation: L1 as the floor (what most users would expect), L2 as opt-in / best-effort with documented limitations. Bytecode hashing (L3) is more work for marginal gain. No level catches dependency upgrades or external-service drift — the hash defends against config drift, not all forms of behavior drift, and the docs should be explicit about that.

Plugin-specific caveats:

  • inspect.getsource() may fail for compiled or zipped plugins
  • Bytecode is Python-version-specific; same source yields different bytecode across versions
  • Treat unhashable plugins as "unknown identity" with a warning, not a hard error

Layered defense: schema validation at row-write time catches structural drift (added/removed columns, dtype shifts) regardless of whether the hash caught it. Hash + schema check cover most realistic drift hazards.

Acceptance criteria

  • A public function (shape TBD during implementation — e.g., DataDesignerConfigBuilder.fingerprint() or a freestanding helper) that returns a deterministic hash for a given config
  • Identical configs produce identical hashes across processes, Python versions, and module load orders
  • Changing any field listed in Include changes the hash
  • Changing any field listed in Exclude does not change the hash
  • Custom column generators contribute at least their registered type name and instance config to the hash (L1). L2 source-hashing may be opt-in
  • Plugins that can't be source-hashed (compiled, zipped, etc.) degrade gracefully with a warning, not a hard error
  • config_hash_version is exposed and bumps cleanly when the normalization scheme changes
  • Tests cover the deterministic property and the include/exclude boundaries

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions