docs: add structured outputs SDG dev notes#338
Conversation
…m/NVIDIA-NeMo/DataDesigner into dhruv/devnotes/structured-outputs
Greptile SummaryAdds comprehensive documentation for the structured outputs SDG pipeline used to generate training data for Nemotron Nano v3's structured output capabilities. The post includes:
The documentation is well-structured, technically accurate, and follows the existing devnotes format established in other posts.
|
| Filename | Overview |
|---|---|
| docs/devnotes/.authors.yml | Adds new author dnathawani to the documentation authors list with proper formatting |
| docs/devnotes/posts/images/structured-outputs-sample-record.png | Screenshot showing sample Data Designer output with seed columns, schema, conversation, and validation results |
| docs/devnotes/posts/structured-outputs.md | Comprehensive dev note documenting structured output SDG pipeline with benchmarks, architecture diagram, and working demo code |
Last reviewed commit: f3765f3
| │ | ||
| ▼ | ||
| ┌─────────────────────────────────────────────────────────────────┐ | ||
| │ STAGE 2: DIVERSITY CONTROLS │ |
There was a problem hiding this comment.
would suggest your AI to make the boxes a bit wider to avoid the warping. Would make those look nicer
There was a problem hiding this comment.
Good suggestion, done
|
|
||
| --- | ||
|
|
||
| ## **Step 1: Seed Data and Schema Generation** |
There was a problem hiding this comment.
Many headings use ##. Maybe not all need to be a heading and just some need to be in bold only.
| 3. **Diversity at every level.** Diverse topics, diverse schemas (depth/width/rigidity), diverse formats, diverse prompts. Each dimension independently improves robustness. | ||
| 4. **Rejection sampling is cheap insurance.** 3x rollouts push per-record validity from ~80% to >95%. The marginal token cost is small compared to the quality gain. | ||
| 5. **Validation must be programmatic.** LLM judges assess *design quality* but cannot reliably detect *schema violations*. `jsonschema` + format parsers are non-negotiable. | ||
| 6. **The hardest formats need the most data.** TOML and XML lag behind JSON and YAML. The pipeline makes it easy to oversample hard formats. |
There was a problem hiding this comment.
Your demo script only does JSON right. Maybe a brief note here how what is needed to extend this to TOML/XML etc
There was a problem hiding this comment.
Good point, made a note
|
|
||
| The stakes are high. When an LLM serves as a backend for tool-calling agents, a single malformed JSON response doesn't just produce a bad answer; it crashes the entire agentic pipeline. The function call fails, the agent can't recover, and the user sees an error. OpenAI, Anthropic, and Google have all invested heavily in structured output guarantees for exactly this reason. | ||
|
|
||
| When we measured our baseline model, roughly 1 in 5 structured outputs was malformed. For an API serving thousands of requests, that's hundreds of failures per hour. Our goal was to reduce this as much as possible through targeted synthetic data. |
There was a problem hiding this comment.
On JSONSchemaBench and 35% on StructEval-Text, right?
mvansegbroeck
left a comment
There was a problem hiding this comment.
Great blogpost. Few minor changes but approving already.
|
|
||
| config.with_seed_dataset( | ||
| dd.DataFrameSeedSource(df=seed_df), | ||
| sampling_strategy=SamplingStrategy.SHUFFLE, |
There was a problem hiding this comment.
nit: this can just be dd.SamplingStrategy.SHUFFLE? Then you wouldn't need to explicitly import SamplingStrategy up top.
There was a problem hiding this comment.
Changed, thanks
|
|
||
| **Key Resources:** | ||
|
|
||
| - **Dataset (download):** [nvidia/Nemotron-RL-instruction_following-structured_outputs](https://huggingface.co/datasets/nvidia/Nemotron-RL-instruction_following-structured_outputs) (CC BY 4.0) |
There was a problem hiding this comment.
This dataset doesn't yet have a datadesigner tag. Can it be added?
There was a problem hiding this comment.
Good point, will reach out to add these
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Summary
Add a dev note documenting the structured outputs SDG pipeline used to generate training data for Nemotron Nano v3's structured output capabilities.
What's in the post
display_sample_record()output showing a complete generated recordLLMStructuredColumnConfigvs dynamic per-record schemasFiles changed
docs/devnotes/posts/structured-outputs.md(new)docs/devnotes/posts/images/structured-outputs-sample-record.png(new)docs/devnotes/.authors.yml(added dnathawani)