feat: let column configs declare all model aliases for the startup health check#626
Conversation
…artup health check Plugin column configs that depend on more than one model alias (generator + judge, critic, etc.) previously could not opt their secondary aliases into the standard startup health check, and configs without a `model_alias` field crashed the collection loop with AttributeError. Add `SingleColumnConfig.get_model_aliases()` as the single override hook the builder uses to enumerate aliases. The default returns the column's primary `model_alias` (if any), so built-in LLM, embedding, and image columns work unchanged. `CustomColumnConfig` overrides it to surface decorator-declared aliases, replacing the special-case `isinstance` branch in the builder. Plugin configs with multiple model fields override it to opt every endpoint into the health check. Fixes #606 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
|
Docs preview: https://fdf813dd.dd-docs-preview.pages.dev
|
Greptile SummaryThis PR introduces
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/base.py | Adds get_model_aliases() with a safe getattr default; logic is correct and handles all edge cases (absent field, empty string, non-null alias). |
| packages/data-designer-config/src/data_designer/config/column_configs.py | Adds CustomColumnConfig.get_model_aliases() delegating to self.model_aliases; correctly replaces the isinstance branch that was in the builder. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Health-check loop simplified to a single call; removal of column_type_is_model_generated gate is safe because the default get_model_aliases() returns [] for non-model configs. |
| packages/data-designer-config/tests/config/test_columns.py | Four new tests cover: absent field, empty-string forwarding, parametrized built-in configs, and the override pattern for multi-model plugins. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Two new builder tests verify alias collection and the skip-when-no-aliases path; regression coverage for #606 is solid. |
| packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py | Small assertion added to existing test confirming get_model_aliases() forwards decorator-declared aliases correctly. |
| docs/plugins/models.md | Docs updated to reflect the new get_model_aliases() hook; the PairwiseJudgeColumnConfig example is consistent with the implementation and removes the now-redundant _validate() workaround. |
Sequence Diagram
sequenceDiagram
participant B as DatasetBuilder
participant C as SingleColumnConfig
participant CC as CustomColumnConfig
participant R as ModelRegistry
B->>B: _run_model_health_check_if_needed()
loop for each config in single_column_configs
B->>C: get_model_aliases()
alt default (built-in LLM/embedding/image)
C-->>B: [model_alias] or []
else CustomColumnConfig override
C->>CC: self.model_aliases (decorator metadata)
CC-->>B: ["alias-a", "alias-b", ...]
else plugin override (e.g. PairwiseJudgeColumnConfig)
C-->>B: [model_alias, judge_model_alias]
end
B->>B: model_aliases.update(result)
end
alt model_aliases is empty
B-->>B: return (skip health check)
else
B->>R: run_health_check(list(model_aliases))
R-->>B: ok / raises on failure
end
Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/get-model-..." | Re-trigger Greptile
Review: PR #626 —
|
andreatgretel
left a comment
There was a problem hiding this comment.
Requesting changes for the empty model_alias regression called out inline. The fix should preserve startup health-check fail-fast behavior by treating only a missing attribute as absent, not falsy string values.
SingleColumnConfig.get_model_aliases() used `if alias` to filter, which also dropped empty-string aliases. Empty model_alias values are accepted by the config model and previously reached run_health_check, where they failed fast with "No model config with alias '' found!". Treating them as "no model endpoints" silently delayed that error to first generation. Use `alias is not None` so only a truly missing attribute skips the health check, and add a regression test that exercises an empty-string model_alias on a built-in config. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
📋 Summary
Plugin column configs that depend on more than one model alias (generator + judge, critic, etc.) could not opt their secondary aliases into the standard startup model health check, and configs without a
model_aliasfield crashed the collection loop withAttributeError. This PR adds a single overridable accessor —SingleColumnConfig.get_model_aliases()— so every column type, built-in or plugin, can declare exactly which model endpoints the startup health check should ping.🔗 Related Issue
Fixes #606
🔄 Changes
SingleColumnConfig.get_model_aliases()(base.py) readsmodel_aliasviagetattrso configs without that field return[]instead of raising. Built-in LLM, embedding, and image configs work unchanged via this default.CustomColumnConfig.get_model_aliases()override (column_configs.py) surfaces decorator-declared aliases through the same hook, replacing theisinstance(config, CustomColumnConfig)branch in the builder._run_model_health_check_if_needed(dataset_builder.py) collapses to a single loop overconfig.get_model_aliases(). TheCustomColumnConfigimport in the engine is removed (no longer needed).PairwiseJudgeColumnConfigexample to overrideget_model_aliases()instead of the previous workaround that registration-validated extra aliases in_validate(). The "Health checks and scheduling" section is updated to reflect accessor-driven collection.test_columns.py,test_dataset_builder.py, andtest_custom.pycover: the default behavior on configs withoutmodel_alias, the primary-alias path on built-in LLM/embedding/image configs (parametrized), the override pattern from the docs example, the multi-alias collection in the builder (regression test for feat(engine): let plugin column configs declare all model aliases for the startup health check #606), and the no-op skip path for sampler-only configs.🧪 Testing
pytest packages/data-designer-config/tests packages/data-designer-engine/tests packages/data-designer/testspasses (3262 tests)ruff format --checkandruff checkclean across the workspace✅ Checklist
feat(engine):prefix matches recent history)AGENTS.mdinvariants unchanged; this is a hook on an existing class, not a layering change)🔍 Attention Areas
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py— this PR drops thecolumn_type_is_model_generated(config.column_type)gate from the health-check loop, which the issue's proposed snippet kept. The reason:CustomColumnGeneratorextendsColumnGenerator, notColumnGeneratorWithModelRegistry, so that gate excludescustomcolumns and the existingisinstancebranch would not actually be subsumed by the new accessor. Callingget_model_aliases()on every config relies on the default returning[]for non-model configs (samplers, expressions, validators, seed datasets), which is verified by the newtest_run_model_health_check_skips_when_no_model_aliasestest. The newtest_run_model_health_check_collects_aliases_from_get_model_aliasestest asserts that built-in LLM andCustomColumnConfigaliases are both collected through the single loop.docs/plugins/models.md— thePairwiseJudgeColumnGeneratorexample loses its_validate()method. That method previously calledget_model_config(...)for the secondary alias to fail-fast on registration errors. Now thatget_model_aliases()opts that alias intorun_health_check, the registration check happens for free and the endpoint is actually exercised at startup, so the manual workaround is redundant.