Add a deterministic hash to uniquely identify a workflow config

## Problem

There is currently no canonical way to **uniquely identify a workflow configuration**. Multiple features need this primitive:

- Determining whether two configurations are the same (without writing ad-hoc field-by-field comparisons that drift as configs evolve)
- Recording exact provenance in dataset metadata / dataset cards
- Diffing configurations between runs
- Any future caching, deduplication, or compatibility-gating logic that needs to ask "is this the same config that produced X?"

Today, anyone wanting to answer that question has to compare ad-hoc subsets of fields, which is brittle and easy to get wrong (especially around nested column / model / sampler / processor configs).

## Proposed solution

Provide a deterministic, content-addressable hash of a workflow config, computed from a normalized canonical dump of the identity-relevant subset.

### What's hashed

| Include | Why |
|---|---|
| Column configs (names, types, generator params, processors, validators) | Changing any of these changes per-row output |
| Model configs (model identity + sampling params that affect generation) | Different temperature/top-p → different distribution |
| Sampler / RNG seed (when set) | Changes which samples come out |
| Seed dataset identity (path + content hash, or just content hash) | Different inputs → different outputs |
| Buffer size | Changes batch/row-group alignment of any checkpointed state |

| Exclude | Why |
|---|---|
| `dataset_name`, output path | Path identity, not data identity |
| Concurrency / threading params | Affect speed, not output |
| API keys, endpoints, logging | Environment, not data |

### Shape

A reasonable shape: ``sha256(json.dumps(normalized_config, sort_keys=True, separators=(",", ":"), default=str))``.

When persisted alongside an artifact, store enough metadata to interpret the hash later:

```jsonc
{
  "config_hash": "sha256:9f3a...d217",
  "config_hash_algo": "sha256",
  "config_hash_version": 1
}
```

`config_hash_version` lets the normalization scheme evolve over time: old-version hashes can be treated as "unknown identity" rather than as a definite mismatch when the scheme changes.

## Custom columns: how deep does the hash go?

Custom column generators (registered via the plugin / entry-point system) need explicit treatment. The easy parts — registered type name and instance config / parameters — fit naturally into the same model as builtin column configs. The harder question is what to do about the **implementation behind the registration**: if a user edits their custom column's code without changing its name or params, the output changes but the config doesn't.

Three escalating levels of strictness:

| Level | What's hashed | Catches | Misses |
|---|---|---|---|
| **L1: config-only** | Registered type name + config dict | Param changes, type swaps | Impl edits, dependency upgrades |
| **L2: + source hash** | L1 + `inspect.getsource()` of the registered class/function | Most impl edits | Whitespace/comment-only changes flip the hash; transitive changes via helpers |
| **L3: + bytecode hash** | L1 + bytecode hash | Impl edits resilient to cosmetics | Transitive changes; non-Python deps |

**Recommendation:** L1 as the floor (what most users would expect), L2 as opt-in / best-effort with documented limitations. Bytecode hashing (L3) is more work for marginal gain. No level catches dependency upgrades or external-service drift — the hash defends against config drift, not all forms of behavior drift, and the docs should be explicit about that.

**Plugin-specific caveats:**

- `inspect.getsource()` may fail for compiled or zipped plugins
- Bytecode is Python-version-specific; same source yields different bytecode across versions
- Treat unhashable plugins as "unknown identity" with a warning, not a hard error

**Layered defense:** schema validation at row-write time catches structural drift (added/removed columns, dtype shifts) regardless of whether the hash caught it. Hash + schema check cover most realistic drift hazards.

## Acceptance criteria

- A public function (shape TBD during implementation — e.g., `DataDesignerConfigBuilder.fingerprint()` or a freestanding helper) that returns a deterministic hash for a given config
- Identical configs produce identical hashes across processes, Python versions, and module load orders
- Changing any field listed in **Include** changes the hash
- Changing any field listed in **Exclude** does not change the hash
- Custom column generators contribute at least their registered type name and instance config to the hash (L1). L2 source-hashing may be opt-in
- Plugins that can't be source-hashed (compiled, zipped, etc.) degrade gracefully with a warning, not a hard error
- `config_hash_version` is exposed and bumps cleanly when the normalization scheme changes
- Tests cover the deterministic property and the include/exclude boundaries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a deterministic hash to uniquely identify a workflow config #584

Problem

Proposed solution

What's hashed

Shape

Custom columns: how deep does the hash go?

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Include	Why
Column configs (names, types, generator params, processors, validators)	Changing any of these changes per-row output
Model configs (model identity + sampling params that affect generation)	Different temperature/top-p → different distribution
Sampler / RNG seed (when set)	Changes which samples come out
Seed dataset identity (path + content hash, or just content hash)	Different inputs → different outputs
Buffer size	Changes batch/row-group alignment of any checkpointed state

Exclude	Why
`dataset_name`, output path	Path identity, not data identity
Concurrency / threading params	Affect speed, not output
API keys, endpoints, logging	Environment, not data

Level	What's hashed	Catches	Misses
L1: config-only	Registered type name + config dict	Param changes, type swaps	Impl edits, dependency upgrades
L2: + source hash	L1 + `inspect.getsource()` of the registered class/function	Most impl edits	Whitespace/comment-only changes flip the hash; transitive changes via helpers
L3: + bytecode hash	L1 + bytecode hash	Impl edits resilient to cosmetics	Transitive changes; non-Python deps

Add a deterministic hash to uniquely identify a workflow config #584

Description

Problem

Proposed solution

What's hashed

Shape

Custom columns: how deep does the hash go?

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions