Skip to content

feat: let column configs declare all model aliases for the startup health check#626

Merged
nabinchha merged 4 commits into
mainfrom
feat/get-model-aliases-606
May 11, 2026
Merged

feat: let column configs declare all model aliases for the startup health check#626
nabinchha merged 4 commits into
mainfrom
feat/get-model-aliases-606

Conversation

@nabinchha

Copy link
Copy Markdown
Contributor

📋 Summary

Plugin column configs that depend on more than one model alias (generator + judge, critic, etc.) could not opt their secondary aliases into the standard startup model health check, and configs without a model_alias field crashed the collection loop with AttributeError. This PR adds a single overridable accessor — SingleColumnConfig.get_model_aliases() — so every column type, built-in or plugin, can declare exactly which model endpoints the startup health check should ping.

🔗 Related Issue

Fixes #606

🔄 Changes

  • New default accessor SingleColumnConfig.get_model_aliases() (base.py) reads model_alias via getattr so configs without that field return [] instead of raising. Built-in LLM, embedding, and image configs work unchanged via this default.
  • CustomColumnConfig.get_model_aliases() override (column_configs.py) surfaces decorator-declared aliases through the same hook, replacing the isinstance(config, CustomColumnConfig) branch in the builder.
  • _run_model_health_check_if_needed (dataset_builder.py) collapses to a single loop over config.get_model_aliases(). The CustomColumnConfig import in the engine is removed (no longer needed).
  • Plugin docs (docs/plugins/models.md) update the multi-model PairwiseJudgeColumnConfig example to override get_model_aliases() instead of the previous workaround that registration-validated extra aliases in _validate(). The "Health checks and scheduling" section is updated to reflect accessor-driven collection.
  • New tests in test_columns.py, test_dataset_builder.py, and test_custom.py cover: the default behavior on configs without model_alias, the primary-alias path on built-in LLM/embedding/image configs (parametrized), the override pattern from the docs example, the multi-alias collection in the builder (regression test for feat(engine): let plugin column configs declare all model aliases for the startup health check #606), and the no-op skip path for sampler-only configs.

🧪 Testing

  • pytest packages/data-designer-config/tests packages/data-designer-engine/tests packages/data-designer/tests passes (3262 tests)
  • Unit tests added: 6 new tests covering default behavior, built-in configs, the multi-model override pattern, custom column decorator metadata, and the dataset-builder health-check collection loop
  • ruff format --check and ruff check clean across the workspace
  • License headers up to date
  • E2E tests added/updated — N/A (config/builder layer, no new end-to-end surface)

✅ Checklist

  • Follows commit message conventions (feat(engine): prefix matches recent history)
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A (AGENTS.md invariants unchanged; this is a hook on an existing class, not a layering change)

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py — this PR drops the column_type_is_model_generated(config.column_type) gate from the health-check loop, which the issue's proposed snippet kept. The reason: CustomColumnGenerator extends ColumnGenerator, not ColumnGeneratorWithModelRegistry, so that gate excludes custom columns and the existing isinstance branch would not actually be subsumed by the new accessor. Calling get_model_aliases() on every config relies on the default returning [] for non-model configs (samplers, expressions, validators, seed datasets), which is verified by the new test_run_model_health_check_skips_when_no_model_aliases test. The new test_run_model_health_check_collects_aliases_from_get_model_aliases test asserts that built-in LLM and CustomColumnConfig aliases are both collected through the single loop.
  • docs/plugins/models.md — the PairwiseJudgeColumnGenerator example loses its _validate() method. That method previously called get_model_config(...) for the secondary alias to fail-fast on registration errors. Now that get_model_aliases() opts that alias into run_health_check, the registration check happens for free and the endpoint is actually exercised at startup, so the manual workaround is redundant.

…artup health check

Plugin column configs that depend on more than one model alias (generator + judge,
critic, etc.) previously could not opt their secondary aliases into the standard
startup health check, and configs without a `model_alias` field crashed the
collection loop with AttributeError.

Add `SingleColumnConfig.get_model_aliases()` as the single override hook the
builder uses to enumerate aliases. The default returns the column's primary
`model_alias` (if any), so built-in LLM, embedding, and image columns work
unchanged. `CustomColumnConfig` overrides it to surface decorator-declared
aliases, replacing the special-case `isinstance` branch in the builder. Plugin
configs with multiple model fields override it to opt every endpoint into the
health check.

Fixes #606

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner May 9, 2026 00:24
@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Docs preview: https://fdf813dd.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@greptile-apps

greptile-apps Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces SingleColumnConfig.get_model_aliases() as a single overridable hook that replaces the ad-hoc two-branch logic in _run_model_health_check_if_needed. The refactor also fixes an AttributeError that crashed the health-check loop when iterating over configs that lacked a model_alias field, and enables plugin configs with multiple model endpoints to opt all their aliases into the startup health check.

  • base.py: Adds get_model_aliases() to SingleColumnConfig; the default uses getattr so configs without a model_alias field return [] rather than raising, and an explicit empty string is forwarded so the registry error surfaces eagerly at startup.
  • column_configs.py: CustomColumnConfig overrides get_model_aliases() to return self.model_aliases (decorator-declared aliases), replacing the isinstance branch that was in the builder.
  • dataset_builder.py: _run_model_health_check_if_needed collapses to a single loop; the CustomColumnConfig import is removed.
  • docs/plugins/models.md: Updates the PairwiseJudgeColumnConfig example to override get_model_aliases() instead of the previous _validate() workaround; six new unit tests cover all paths.

Confidence Score: 5/5

Safe to merge — the refactor is well-scoped, all edge cases are covered by new tests, and the behavioral change is backward-compatible for existing built-in configs.

All changed paths are straightforward: the new default uses a safe getattr fallback, the CustomColumnConfig override directly delegates to the already-tested model_aliases property, and the builder loop simplification is equivalent to the old two-branch logic for every config type. The six new tests cover absent-field, empty-string, parametrized built-ins, override pattern, and the builder's skip path, leaving no meaningful gap in coverage.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/base.py Adds get_model_aliases() with a safe getattr default; logic is correct and handles all edge cases (absent field, empty string, non-null alias).
packages/data-designer-config/src/data_designer/config/column_configs.py Adds CustomColumnConfig.get_model_aliases() delegating to self.model_aliases; correctly replaces the isinstance branch that was in the builder.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Health-check loop simplified to a single call; removal of column_type_is_model_generated gate is safe because the default get_model_aliases() returns [] for non-model configs.
packages/data-designer-config/tests/config/test_columns.py Four new tests cover: absent field, empty-string forwarding, parametrized built-in configs, and the override pattern for multi-model plugins.
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py Two new builder tests verify alias collection and the skip-when-no-aliases path; regression coverage for #606 is solid.
packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py Small assertion added to existing test confirming get_model_aliases() forwards decorator-declared aliases correctly.
docs/plugins/models.md Docs updated to reflect the new get_model_aliases() hook; the PairwiseJudgeColumnConfig example is consistent with the implementation and removes the now-redundant _validate() workaround.

Sequence Diagram

sequenceDiagram
    participant B as DatasetBuilder
    participant C as SingleColumnConfig
    participant CC as CustomColumnConfig
    participant R as ModelRegistry

    B->>B: _run_model_health_check_if_needed()
    loop for each config in single_column_configs
        B->>C: get_model_aliases()
        alt default (built-in LLM/embedding/image)
            C-->>B: [model_alias] or []
        else CustomColumnConfig override
            C->>CC: self.model_aliases (decorator metadata)
            CC-->>B: ["alias-a", "alias-b", ...]
        else plugin override (e.g. PairwiseJudgeColumnConfig)
            C-->>B: [model_alias, judge_model_alias]
        end
        B->>B: model_aliases.update(result)
    end
    alt model_aliases is empty
        B-->>B: return (skip health check)
    else
        B->>R: run_health_check(list(model_aliases))
        R-->>B: ok / raises on failure
    end
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/get-model-..." | Re-trigger Greptile

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Review: PR #626feat(engine): let column configs declare all model aliases for the startup health check

Summary

Replaces two special cases in _run_model_health_check_if_needed (a column_type_is_model_generated branch and an isinstance(config, CustomColumnConfig) branch) with a single polymorphic accessor SingleColumnConfig.get_model_aliases() on the config base class. The default reads model_alias via getattr (returning [] when absent), and CustomColumnConfig overrides it to surface the decorator-declared aliases. Fixes #606 — plugin configs that declare multiple model aliases can now opt every endpoint into the startup health check, and configs without a model_alias field no longer crash the loop.

Change is small and focused: ~24 lines of source changes, ~100 lines of tests, doc updates to match.

Findings

Design / architecture

  • Good fit for the "declare, don't orchestrate" contract. Moving the alias-collection decision from the engine's isinstance branches to a virtual method on the config is consistent with the registry/polymorphism pattern called out in AGENTS.md. The engine becomes one line and no longer needs to know about CustomColumnConfig.
  • Import layering improves. Dropping from data_designer.config.column_configs import CustomColumnConfig in dataset_builder.py reduces concrete-type coupling between engine and config. Direction is unchanged (engine → config) but the engine now only depends on the base class hook.
  • Blast radius is small. Only one call site (_run_model_health_check_if_needed) consumes the new accessor; the method is new and can't break subclasses that don't override it.

Correctness

  • Dropped column_type_is_model_generated gate is intentional and safe. The PR description explains the reasoning well: CustomColumnGenerator extends ColumnGenerator, not ColumnGeneratorWithModelRegistry, so the old gate excluded custom columns — which is exactly why the prior code needed the isinstance branch. The new code relies on the default get_model_aliases() returning [] for any config without a model_alias attribute, which is covered by test_run_model_health_check_skips_when_no_model_aliases.
  • getattr(self, "model_alias", None) is the right escape hatch. It correctly degrades to [] for configs with no model_alias field, fixing the AttributeError crash called out in the issue. The if alias else [] branch also eats empty strings, which is desirable (an empty alias would not resolve against ModelRegistry).
  • Backward compatibility preserved. Built-in LLMTextColumnConfig / LLMCodeColumnConfig / EmbeddingColumnConfig / ImageColumnConfig reach the same behavior as before through the default implementation. Any out-of-tree plugin that does not override get_model_aliases() also sees no change — only the primary model_alias is pinged, matching the previous contract.
  • Minor semantic shift to be aware of. The old gate restricted health checks to configs whose column_type was registered as model-generated. The new code will collect model_alias from any SingleColumnConfig subclass that happens to define a model_alias attribute, even if its generator is not model-backed. In practice this is only LLM/embedding/image configs today, so no current config is affected — but worth noting as a future footgun if someone adds a non-model column that carries a model_alias field for other reasons. Consider documenting this in the base-class docstring or, more conservatively, keeping the column_type_is_model_generated gate and only falling through to the accessor for CustomColumnConfig. The current approach is cleaner; flagging for awareness rather than as a blocker.

API shape

  • CustomColumnConfig now has both a model_aliases property (existing) and a get_model_aliases() method (new) that trivially returns self.model_aliases. The two names are close enough to cause confusion — model_aliases is the decorator-metadata getter, get_model_aliases() is the health-check hook. The override docstring helps, but a future refactor could unify these (e.g., have the base class's get_model_aliases() default be a property, or rename the existing property). Not a blocker for this PR.
  • get_model_aliases() is a public, non-abstract method on SingleColumnConfig. That's the right call for this change, but note it becomes part of the plugin author API surface — the docs update handles this.

Docs (docs/plugins/models.md)

  • The rewrite is accurate and reads well. The deleted paragraph explaining the _validate() registration-check workaround is genuinely obsolete now that run_health_check covers the same ground and pings the endpoint.
  • The PairwiseJudgeColumnGenerator example no longer defines _validate(). This is consistent with the new guidance, but existing plugins that still implement _validate() for this purpose will continue to work — it just becomes redundant, not broken. A one-line note pointing this out to readers with existing plugins might be worth adding, though not essential.
  • L163's "uncommon but valid" caveat about configs without a model_alias field is a nice touch.

Tests

  • Coverage is proportional to the change. The six new tests hit: default behavior on a field-less config, primary-alias path on four built-ins (parametrized), multi-alias override pattern, CustomColumnConfig override flowing through the property, and both the collection and skip branches of _run_model_health_check_if_needed.
  • test_run_model_health_check_collects_aliases_from_get_model_aliases is a strong regression test for feat(engine): let plugin column configs declare all model aliases for the startup health check #606 — it directly exercises the bug (secondary aliases not being pinged).
  • Minor gap: no explicit test for the model_alias="" empty-string edge case (the if alias branch would filter it). Not worth blocking on.
  • Test placement matches project convention (config tests in the config package, engine tests in the engine package).

Style / conventions

  • from __future__ import annotations present in touched source files (verified in base.py and column_configs.py context).
  • Type annotations are present and use modern syntax (list[str], str | None).
  • No relative imports added.
  • The alias: str | None = getattr(self, "model_alias", None) annotation is a slight over-claim — getattr could technically return anything, and model_alias on some config could be non-str. In practice model_alias is typed as str across built-in configs, so the annotation holds. Fine to leave as-is.

Security / performance

  • No security concerns. The change does not broaden what is health-checked in a way that would expose new endpoints — it simply lets multi-alias configs opt their secondary endpoints in, which is the explicit goal.
  • No performance concerns. The loop was already O(configs); it remains so.

Verdict

Approve with minor suggestions. This is a clean, well-scoped refactor that fixes a real bug (#606), removes a special case from the engine, and improves separation of concerns between engine and config. Tests are adequate and the docs are updated in lockstep. The only thing I'd consider adding is a one-line clarification in the base-class docstring (or in the plugin docs) that the accessor will be consulted for every SingleColumnConfig, so a plugin author who defines a model_alias on a non-model config should be aware the default collects it — but that's a nice-to-have, not a blocker.

Comment thread packages/data-designer-config/src/data_designer/config/base.py Outdated

@andreatgretel andreatgretel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for the empty model_alias regression called out inline. The fix should preserve startup health-check fail-fast behavior by treating only a missing attribute as absent, not falsy string values.

nabinchha added 2 commits May 11, 2026 09:33
SingleColumnConfig.get_model_aliases() used `if alias` to filter, which
also dropped empty-string aliases. Empty model_alias values are accepted
by the config model and previously reached run_health_check, where they
failed fast with "No model config with alias '' found!". Treating them
as "no model endpoints" silently delayed that error to first generation.

Use `alias is not None` so only a truly missing attribute skips the
health check, and add a regression test that exercises an empty-string
model_alias on a built-in config.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha changed the title feat(engine): let column configs declare all model aliases for the startup health check feat: let column configs declare all model aliases for the startup health check May 11, 2026
@nabinchha nabinchha requested a review from andreatgretel May 11, 2026 15:36

@andreatgretel andreatgretel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@nabinchha nabinchha merged commit 4b93f5b into main May 11, 2026
50 checks passed
@nabinchha nabinchha deleted the feat/get-model-aliases-606 branch May 11, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(engine): let plugin column configs declare all model aliases for the startup health check

2 participants