Priority Level
Medium (Annoying but has workaround)
Describe the bug
Summary
Using a schema with discriminated unions as output_format for LLMStructuredColumnConfig fails because Data Designer’s validation is too strict. The LLM produces valid output (correct discriminator and fields for each item), but we don’t use the discriminator. Instead we validate against every variant and prune the same instance for each; that cross-variant pruning removes fields that are valid for the intended variant (e.g. we prune beta_tags when checking against AlphaItem). We corrupt valid data which then matches zero variants, so validation fails. Retries fail the same way; the record is eventually dropped.
What I'm trying to do
I want to generate data where the output schema uses discriminated unions — e.g. a list of items that can be either "alpha" (with alpha_detail) or "beta" (with beta_tags), distinguished by a kind field.
from typing import Annotated, Literal, Union
from pydantic import BaseModel, Field
class AlphaItem(BaseModel):
kind: Literal["alpha"] = Field(default="alpha")
name: str
alpha_detail: dict
class BetaItem(BaseModel):
kind: Literal["beta"] = Field(default="beta")
name: str
beta_tags: list[str]
UnionItem = Annotated[
Union[AlphaItem, BetaItem],
Field(discriminator="kind"),
]
class Container(BaseModel):
items: list[UnionItem]
Using Container (or its JSON schema) as output_format for an LLMStructuredColumnConfig reliably fails due to over-pruning.
What the schema looks like (JSON Schema)
Pydantic emits oneOf plus a discriminator (OpenAPI-style). The discriminator says which variant to use from kind; jsonschema doesn't use it and just tries every variant.
{
"$defs": {
"AlphaItem": {
"properties": {...},
"required": [
"name",
"alpha_detail"
],
"title": "AlphaItem",
"type": "object"
},
"BetaItem": {
"properties": {...},
"required": [
"name",
"beta_tags"
],
"title": "BetaItem",
"type": "object"
}
},
"properties": {
"items": {
"items": {
"discriminator": {
"mapping": {
"alpha": "#/$defs/AlphaItem",
"beta": "#/$defs/BetaItem"
},
"propertyName": "kind"
},
"oneOf": [
{
"$ref": "#/$defs/AlphaItem"
},
{
"$ref": "#/$defs/BetaItem"
}
]
},
"title": "Items",
"type": "array"
}
},
"required": [
"items"
],
"title": "Container",
"type": "object"
}
How we prune additional fields
We set additionalProperties: false on every object in the schema (forbid_additional_properties() in packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/schema_transformers.py), and we extend the validator so that instead of raising on extra keys, we prune them in-place (prune_additional_properties() in packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py). That works for a single object schema. The problem is when that pruning runs inside oneOf.
Why it breaks: we prune across variants
jsonschema’s oneOf (see oneOf in jsonschema’s _keywords.py) validates the instance against each variant in turn. We don’t use the schema’s discriminator; we just use that default behavior. So for a single instance we run validation (and thus our pruning) once per variant, on the same object. The LLM might produce a correct alpha item (kind, name, alpha_detail). Then:
- Checking against AlphaItem: instance matches (maybe we prune some extra keys).
- Checking against BetaItem: we prune the same instance again — and remove
alpha_detail (not in BetaItem’s properties). The instance no longer has the fields required by AlphaItem and still doesn’t match BetaItem (kind is "alpha").
So we strip valid fields by pruning the instance against every variant. The instance is over-pruned or matches zero variants; either way validation fails. The discriminator already tells us which variant is intended; we never use it, so our validation is too strict.
What happens
The LLM produces output that is correct for the intended variant (right discriminator, right fields). Our validation prunes the instance against every variant and ends up removing valid fields or failing the “exactly one match” rule. We retry; the same behavior repeats. After retries are exhausted the record is dropped. You see repeated “Unspecified property removed…” logs and no valid row — an unproductive loop.
Suggested fix
When oneOf has a sibling discriminator, use it: read the discriminator property from the instance (e.g. kind), pick the single variant from the mapping, and validate/prune only against that variant. Non-discriminated oneOf can keep the default “try all variants” behavior.
Steps/Code to reproduce bug
Minimal reproducible bug attached.
bug_report_discriminated_union.py
Expected behavior
The LLM is capable of producing data for schemas that include discriminated unions, we should fix the parsing in Data Designer to avoid over-pruning so these records aren't discarded.
Additional context
No response
Priority Level
Medium (Annoying but has workaround)
Describe the bug
Summary
Using a schema with discriminated unions as
output_formatforLLMStructuredColumnConfigfails because Data Designer’s validation is too strict. The LLM produces valid output (correct discriminator and fields for each item), but we don’t use the discriminator. Instead we validate against every variant and prune the same instance for each; that cross-variant pruning removes fields that are valid for the intended variant (e.g. we prunebeta_tagswhen checking against AlphaItem). We corrupt valid data which then matches zero variants, so validation fails. Retries fail the same way; the record is eventually dropped.What I'm trying to do
I want to generate data where the output schema uses discriminated unions — e.g. a list of items that can be either "alpha" (with
alpha_detail) or "beta" (withbeta_tags), distinguished by akindfield.Using
Container(or its JSON schema) asoutput_formatfor anLLMStructuredColumnConfigreliably fails due to over-pruning.What the schema looks like (JSON Schema)
Pydantic emits
oneOfplus adiscriminator(OpenAPI-style). The discriminator says which variant to use fromkind; jsonschema doesn't use it and just tries every variant.{ "$defs": { "AlphaItem": { "properties": {...}, "required": [ "name", "alpha_detail" ], "title": "AlphaItem", "type": "object" }, "BetaItem": { "properties": {...}, "required": [ "name", "beta_tags" ], "title": "BetaItem", "type": "object" } }, "properties": { "items": { "items": { "discriminator": { "mapping": { "alpha": "#/$defs/AlphaItem", "beta": "#/$defs/BetaItem" }, "propertyName": "kind" }, "oneOf": [ { "$ref": "#/$defs/AlphaItem" }, { "$ref": "#/$defs/BetaItem" } ] }, "title": "Items", "type": "array" } }, "required": [ "items" ], "title": "Container", "type": "object" }How we prune additional fields
We set
additionalProperties: falseon every object in the schema (forbid_additional_properties()inpackages/data-designer-engine/src/data_designer/engine/processing/gsonschema/schema_transformers.py), and we extend the validator so that instead of raising on extra keys, we prune them in-place (prune_additional_properties()inpackages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py). That works for a single object schema. The problem is when that pruning runs inside oneOf.Why it breaks: we prune across variants
jsonschema’s oneOf (see
oneOfin jsonschema’s_keywords.py) validates the instance against each variant in turn. We don’t use the schema’sdiscriminator; we just use that default behavior. So for a single instance we run validation (and thus our pruning) once per variant, on the same object. The LLM might produce a correct alpha item (kind,name,alpha_detail). Then:alpha_detail(not in BetaItem’s properties). The instance no longer has the fields required by AlphaItem and still doesn’t match BetaItem (kindis"alpha").So we strip valid fields by pruning the instance against every variant. The instance is over-pruned or matches zero variants; either way validation fails. The discriminator already tells us which variant is intended; we never use it, so our validation is too strict.
What happens
The LLM produces output that is correct for the intended variant (right discriminator, right fields). Our validation prunes the instance against every variant and ends up removing valid fields or failing the “exactly one match” rule. We retry; the same behavior repeats. After retries are exhausted the record is dropped. You see repeated “Unspecified property removed…” logs and no valid row — an unproductive loop.
Suggested fix
When
oneOfhas a siblingdiscriminator, use it: read the discriminator property from the instance (e.g.kind), pick the single variant from the mapping, and validate/prune only against that variant. Non-discriminatedoneOfcan keep the default “try all variants” behavior.Steps/Code to reproduce bug
Minimal reproducible bug attached.
bug_report_discriminated_union.py
Expected behavior
The LLM is capable of producing data for schemas that include discriminated unions, we should fix the parsing in Data Designer to avoid over-pruning so these records aren't discarded.
Additional context
No response