Discriminated unions with LLM-structured output cause record-level validation failure

### Priority Level

Medium (Annoying but has workaround)

### Describe the bug

## Summary

Using a schema with discriminated unions as `output_format` for `LLMStructuredColumnConfig` fails because Data Designer’s validation is too strict. The LLM produces valid output (correct discriminator and fields for each item), but we don’t use the discriminator. Instead we validate against **every** variant and prune the same instance for each; that cross-variant pruning removes fields that are valid for the intended variant (e.g. we prune `beta_tags` when checking against AlphaItem). We corrupt valid data which then matches zero variants, so validation fails. Retries fail the same way; the record is eventually dropped.

## What I'm trying to do

I want to generate data where the output schema uses discriminated unions — e.g. a list of items that can be either "alpha" (with `alpha_detail`) or "beta" (with `beta_tags`), distinguished by a `kind` field.

```python
from typing import Annotated, Literal, Union
from pydantic import BaseModel, Field

class AlphaItem(BaseModel):
    kind: Literal["alpha"] = Field(default="alpha")
    name: str
    alpha_detail: dict

class BetaItem(BaseModel):
    kind: Literal["beta"] = Field(default="beta")
    name: str
    beta_tags: list[str]

UnionItem = Annotated[
    Union[AlphaItem, BetaItem],
    Field(discriminator="kind"),
]

class Container(BaseModel):
    items: list[UnionItem]
```

Using `Container` (or its JSON schema) as `output_format` for an `LLMStructuredColumnConfig` reliably fails due to over-pruning.

## What the schema looks like (JSON Schema)

Pydantic emits **`oneOf`** plus a **`discriminator`** (OpenAPI-style). The discriminator says which variant to use from `kind`; jsonschema doesn't use it and just tries every variant.

```json
{
  "$defs": {
    "AlphaItem": {
      "properties": {...},
      "required": [
        "name",
        "alpha_detail"
      ],
      "title": "AlphaItem",
      "type": "object"
    },
    "BetaItem": {
      "properties": {...},
      "required": [
        "name",
        "beta_tags"
      ],
      "title": "BetaItem",
      "type": "object"
    }
  },
  "properties": {
    "items": {
      "items": {
        "discriminator": {
          "mapping": {
            "alpha": "#/$defs/AlphaItem",
            "beta": "#/$defs/BetaItem"
          },
          "propertyName": "kind"
        },
        "oneOf": [
          {
            "$ref": "#/$defs/AlphaItem"
          },
          {
            "$ref": "#/$defs/BetaItem"
          }
        ]
      },
      "title": "Items",
      "type": "array"
    }
  },
  "required": [
    "items"
  ],
  "title": "Container",
  "type": "object"
}
```

## How we prune additional fields

We set `additionalProperties: false` on every object in the schema (`forbid_additional_properties()` in `packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/schema_transformers.py`), and we extend the validator so that instead of raising on extra keys, we **prune** them in-place (`prune_additional_properties()` in `packages/data-designer-engine/src/data_designer/engine/processing/gsonschema/validators.py`). That works for a single object schema. The problem is when that pruning runs inside **oneOf**.

## Why it breaks: we prune across variants

jsonschema’s **oneOf** (see `oneOf` in jsonschema’s `_keywords.py`) validates the instance against **each** variant in turn. We don’t use the schema’s `discriminator`; we just use that default behavior. So for a single instance we run validation (and thus our pruning) once per variant, **on the same object**. The LLM might produce a correct alpha item (`kind`, `name`, `alpha_detail`). Then:

- Checking against **AlphaItem**: instance matches (maybe we prune some extra keys).
- Checking against **BetaItem**: we prune the same instance again — and remove `alpha_detail` (not in BetaItem’s properties). The instance no longer has the fields required by AlphaItem and still doesn’t match BetaItem (`kind` is `"alpha"`).

So **we** strip valid fields by pruning the instance against every variant. The instance is over-pruned or matches zero variants; either way validation fails. The **discriminator** already tells us which variant is intended; we never use it, so our validation is too strict.

## What happens

The LLM produces output that is correct for the intended variant (right discriminator, right fields). Our validation prunes the instance against every variant and ends up removing valid fields or failing the “exactly one match” rule. We retry; the same behavior repeats. After retries are exhausted the **record is dropped**. You see repeated “Unspecified property removed…” logs and no valid row — an unproductive loop.

## Suggested fix

When `oneOf` has a sibling **`discriminator`**, use it: read the discriminator property from the instance (e.g. `kind`), pick the single variant from the mapping, and validate/prune only against that variant. Non-discriminated `oneOf` can keep the default “try all variants” behavior.

### Steps/Code to reproduce bug

Minimal reproducible bug attached.
[bug_report_discriminated_union.py](https://github.com/user-attachments/files/25798615/bug_report_discriminated_union.py)

### Expected behavior

The LLM is capable of producing data for schemas that include discriminated unions, we should fix the parsing in Data Designer to avoid over-pruning so these records aren't discarded.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discriminated unions with LLM-structured output cause record-level validation failure #375

Priority Level

Describe the bug

Summary

What I'm trying to do

What the schema looks like (JSON Schema)

How we prune additional fields

Why it breaks: we prune across variants

What happens

Suggested fix

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discriminated unions with LLM-structured output cause record-level validation failure #375

Description

Priority Level

Describe the bug

Summary

What I'm trying to do

What the schema looks like (JSON Schema)

How we prune additional fields

Why it breaks: we prune across variants

What happens

Suggested fix

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions