Skip to content

Discriminated unions with LLM-structured output cause record-level validation failure #375

@jeremyjordan

Description

@jeremyjordan

Priority Level

Medium (Annoying but has workaround)

Describe the bug

Summary

Using a schema with discriminated unions as output_format for LLMStructuredColumnConfig fails because Data Designer’s validation is too strict. The LLM produces valid output (correct discriminator and fields for each item), but we don’t use the discriminator. Instead we validate against every variant and prune the same instance for each; that cross-variant pruning removes fields that are valid for the intended variant (e.g. we prune beta_tags when checking against AlphaItem). We corrupt valid data which then matches zero variants, so validation fails. Retries fail the same way; the record is eventually dropped.

What I'm trying to do

I want to generate data where the output schema uses discriminated unions — e.g. a list of items that can be either "alpha" (with alpha_detail) or "beta" (with beta_tags), distinguished by a kind field.

from typing import Annotated, Literal, Union
from pydantic import BaseModel, Field

class AlphaItem(BaseModel):
    kind: Literal["alpha"] = Field(default="alpha")
    name: str
    alpha_detail: dict

class BetaItem(BaseModel):
    kind: Literal["beta"] = Field(default="beta")
    name: str
    beta_tags: list[str]

UnionItem = Annotated[
    Union[AlphaItem, BetaItem],
    Field(discriminator="kind"),
]

class Container(BaseModel):
    items: list[UnionItem]

Using Container (or its JSON schema) as output_format for an LLMStructuredColumnConfig reliably fails due to over-pruning.

What the schema looks like (JSON Schema)

Pydantic emits oneOf plus a discriminator (OpenAPI-style). The discriminator says which variant to use from kind; jsonschema doesn't use it and just tries every variant.

{
  "$defs": {
    "AlphaItem": {
      "properties": {...},
      "required": [
        "name",
        "alpha_detail"
      ],
      "title": "AlphaItem",
      "type": "object"
    },
    "BetaItem": {
      "properties": {...},
      "required": [
        "name",
        "beta_tags"
      ],
      "title": "BetaItem",
      "type": "object"
    }
  },
  "properties": {
    "items": {
      "items": {
        "discriminator": {
          "mapping": {
            "alpha": "#/$defs/AlphaItem",
            "beta": "#/$defs/BetaItem"
          },
          "propertyName": "kind"
        },
        "oneOf": [
          {
            "$ref": "#/$defs/AlphaItem"
          },
          {
            "$ref": "#/$defs/BetaItem"
          }
        ]
      },
      "title": "Items",
      "type": "array"
    }
  },
  "required": [
    "items"
  ],
  "title": "Container",
  "type": "object"
}

How we prune additional fields

We set additionalProperties: false on every object in the schema (forbid_additional_properties() in packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/schema_transformers.py), and we extend the validator so that instead of raising on extra keys, we prune them in-place (prune_additional_properties() in packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py). That works for a single object schema. The problem is when that pruning runs inside oneOf.

Why it breaks: we prune across variants

jsonschema’s oneOf (see oneOf in jsonschema’s _keywords.py) validates the instance against each variant in turn. We don’t use the schema’s discriminator; we just use that default behavior. So for a single instance we run validation (and thus our pruning) once per variant, on the same object. The LLM might produce a correct alpha item (kind, name, alpha_detail). Then:

  • Checking against AlphaItem: instance matches (maybe we prune some extra keys).
  • Checking against BetaItem: we prune the same instance again — and remove alpha_detail (not in BetaItem’s properties). The instance no longer has the fields required by AlphaItem and still doesn’t match BetaItem (kind is "alpha").

So we strip valid fields by pruning the instance against every variant. The instance is over-pruned or matches zero variants; either way validation fails. The discriminator already tells us which variant is intended; we never use it, so our validation is too strict.

What happens

The LLM produces output that is correct for the intended variant (right discriminator, right fields). Our validation prunes the instance against every variant and ends up removing valid fields or failing the “exactly one match” rule. We retry; the same behavior repeats. After retries are exhausted the record is dropped. You see repeated “Unspecified property removed…” logs and no valid row — an unproductive loop.

Suggested fix

When oneOf has a sibling discriminator, use it: read the discriminator property from the instance (e.g. kind), pick the single variant from the mapping, and validate/prune only against that variant. Non-discriminated oneOf can keep the default “try all variants” behavior.

Steps/Code to reproduce bug

Minimal reproducible bug attached.
bug_report_discriminated_union.py

Expected behavior

The LLM is capable of producing data for schemas that include discriminated unions, we should fix the parsing in Data Designer to avoid over-pruning so these records aren't discarded.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions