docs: add prompt sensitivity devnote by dhruvnathawani · Pull Request #351 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-02-23T23:28:10Z

Summary

Add a dev note documenting the prompt sensitivity SDG pipeline used to generate diverse prompt variations for Nemotron training data across both SFT and RL.

What's in the post

Motivation: Why prompt sensitivity matters for model robustness (up to 15 percentage point accuracy swings from phrasing changes alone)
Prompt anatomy diagram showing the three variable components: preamble, problem (fixed), format instruction
Goal: reduce LLM sensitivity to prompt phrasing by generating diverse preambles and format instructions while keeping the core problem unchanged
Pipeline walkthrough: Seed preambles x format templates (cross-product) -> diversity samplers -> LLM preamble generation -> format instruction paraphrasing -> user prompt composition with placement ordering -> 4 quality judges -> YAML-driven training mixture integration
ASCII pipeline diagram showing the 5-stage flow (seed examples -> samplers -> LLM generation -> dual judges -> training mixtures)
Regex-paired format templates: 25+ answer formats (boxed, brackets, XML tags, asterisks, arrows, etc.), each paired with an extraction regex enabling both SFT diversity and RL reward parsing from a single pipeline
YAML-driven mixture config with majority_percentage control (25% canonical / 75% diverse)
Collapsible full source script using the DD config API (pip install + run)
Key takeaways on sampler-driven diversity, format compliance gating, and unified SFT/RL design

Files changed

docs/devnotes/posts/prompt-sensitivity.md (updated from draft to full PR)
docs/devnotes/posts/prompt_anatomy.png (new)

greptile-apps · 2026-02-26T02:17:10Z

Greptile Summary

This PR adds a new devnote documenting the prompt sensitivity SDG pipeline — a Data Designer workflow that generates diverse MCQ preambles and format instructions to reduce LLM brittleness to prompt phrasing, used in Nemotron training mixtures for both SFT and RL.

New post published in both MkDocs (docs/devnotes/posts/prompt-sensitivity.md) and Fern (fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx), with navigation wired up in mkdocs.yml and fern/versions/latest.yml.
The double_asterisks regex template (\\*\\*([A-Za-z])\\*\\*) has a first-match extraction problem: MCQ chain-of-thought responses routinely bold individual option letters in reasoning text, causing re.search() to return the first such match rather than the intended final-answer marker — the same failure mode as the angle_brackets pattern that was already fixed in this PR.

Confidence Score: 4/5

Safe to merge after addressing the double_asterisks regex extraction bug; all other changes are documentation and navigation wiring.

The double_asterisks regex (\*\*([A-Za-z])\*\*) will silently extract the wrong letter from RL rollouts whenever the model's chain-of-thought bolds option labels mid-reasoning — a common pattern in MCQ responses. The extracted reward signal would be wrong, and readers implementing the pipeline from this doc would inherit the bug. The rest of the PR is documentation and config wiring with no functional concerns.

Both docs/devnotes/posts/prompt-sensitivity.md and fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx at the double_asterisks template entry.

Important Files Changed

Filename	Overview
docs/devnotes/posts/prompt-sensitivity.md	New MkDocs devnote (639 lines) documenting the prompt sensitivity SDG pipeline; the `double_asterisks` regex `\\([A-Za-z])\\` has a first-match problem that causes wrong answer extraction from chain-of-thought responses in RL contexts.
fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx	Fern/MDX mirror of the devnote; carries the same `double_asterisks` regex issue as the `.md` version.
fern/versions/latest/pages/devnotes/index.mdx	Adds a BlogCard entry for the new prompt-sensitivity post at the top of the devnotes index; no issues.
fern/versions/latest.yml	Navigation update adding the Prompt Sensitivity page before Retriever SDG Toolkit; no issues.
mkdocs.yml	Adds prompt-sensitivity.md entry at the top of the Dev Notes nav section (most-recent-first ordering); correct placement.
fern/assets/prompt-sensitivity/prompt-sensitivity-hero.png	New hero image asset for the devnote; binary file, no code issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["STEP 1: Seed Examples\n10 format templates × 5 preamble anchors = 50 rows\nDataFrameSeedSource + SamplingStrategy.SHUFFLE"] --> B
    B["STEP 2: Diversity Samplers\nsentence_type(3) × tone(5) × strictness(3)\n× verbosity(3) × domain(3) × order(8) = 3,240"] --> C
    C["STEP 3: LLM Generation Columns\npreamble | format_instruction | user_prompt"] --> D
    D["STEP 4: Quality Judges\nformat_compliance + regex_alignment + order_coherence (binary)\npreamble_quality 0–3 rubric\n~15–20% fail at least one gate"] --> E
    E["STEP 5: Training Mixture Integration\nYAML-driven: majority_preamble 25% canonical\n75% diverse variations → 1M packed records"]

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
docs/devnotes/posts/prompt-sensitivity.md:481
**`double_asterisks` regex has the same first-match problem as the previously fixed `angle_brackets` pattern**

`re.search(r"\*\*([A-Za-z])\*\*", response)` returns the *first* `**X**` in the response. In a typical MCQ chain-of-thought, the model discusses each option by letter — e.g. "**A** is incorrect because…, **B** also fails…, therefore **C**" — so `re.search` extracts `A` rather than the intended final answer `**C**`. The fix applied to `angle_brackets` (switching from `[A-Za-z]` to `[A-Z]`) reduces false positives from lowercase emphasis but doesn't address the positional issue for uppercase letters; using `re.findall()` and taking the last match, or anchoring the search to the final line, is the reliable fix for RL reward extraction with this format.

### Issue 2 of 2
fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx:168
**`double_asterisks` regex has the same first-match problem as the previously fixed `angle_brackets` pattern**

Same issue as in the `.md` version: `re.search(r"\*\*([A-Za-z])\*\*", response)` returns the first bolded single letter in the response. MCQ chain-of-thought reasoning commonly bolds option labels ("**A** is wrong, **B** is also wrong, **C** is correct"), so the first match will be a discarded option rather than the final answer. Using `re.findall()[-1]` or anchoring to the last line avoids this false extraction.

_{Reviews (9): Last reviewed commit: "docs(fern): recompress prompt sensitivit..." | Re-trigger Greptile}

…nd code dropdown

…m/NVIDIA-NeMo/DataDesigner into dhruv/devnotes/prompt-sensitivity

github-actions · 2026-05-27T17:02:27Z

MkDocs preview: https://69eed809.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-351.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

github-actions · 2026-05-27T17:02:41Z

Review: PR #351 — `docs: add prompt sensitivity devnote`

Summary

Docs-only PR. Adds a new MkDocs devnote (docs/devnotes/posts/prompt-sensitivity.md, 594 lines) plus an inline image (prompt_anatomy.png), and registers the post in mkdocs.yml. The post explains the prompt-sensitivity SDG pipeline used for Nemotron training data: motivation, anatomy diagram, six diversity samplers, dual judges, regex-paired format templates, training-mixture YAML, and a collapsible full-source script.

API symbols cited in the code blocks (DataDesignerConfigBuilder, DataFrameSeedSource, LLMTextColumnConfig, LLMJudgeColumnConfig, Score, SamplerColumnConfig, CategorySamplerParams, ModelConfig, SamplingStrategy) all exist in packages/data-designer-config/. Heading style (# **Title**, ## **Section**) matches the other devnotes in docs/devnotes/posts/.

Findings

1. Inconsistency between narrative pipeline and the "full source" script (medium)

The body of the post and the appendix script describe different pipelines, but the post does not flag this. Readers who scroll through Step 2 and then expand "Try For Yourself" will see two non-matching configurations.

Step 2 narrative (around line 196 onward) defines six samplers including answer_format with 3 values: \boxed{}, \boxed{LETTER}, \boxed{<letter>}. The diagram on line 124 advertises the combinatorial space as 3 × 5 × 3 × 3 × 3 × 3 = 1,215.
The full source (line 481 onward) replaces answer_format with preamble_format_order (8 values for component ordering). Format diversity in the full source comes from the cross-product seed data (MCQ_FORMAT_TEMPLATES × SEED_PREAMBLES), not from a sampler. There is no answer_format sampler in the full source.
That means the actual combinatorial space of the full source is 3 × 5 × 3 × 3 × 3 × 8 = 3,240, sampled jointly with 50 seed rows — and the "1,215 combinations" claim in the diagram is for a different pipeline than the one shipped in the appendix.

Recommendation: either (a) align the narrative samplers to the full source, or (b) explicitly note that the narrative is a simplified didactic pipeline and the appendix is the production version, calling out the differences (cross-product seeds, the placement-order sampler).

2. "25+ format templates" claim vs. 10 templates in the source (low–medium)

Multiple places in the post advertise "25+ distinct format templates":

Line 384: "We curated 25+ distinct format templates spanning ..."
PR description: "25+ answer formats (boxed, brackets, XML tags, asterisks, arrows, etc.)"

The full source MCQ_FORMAT_TEMPLATES list (lines 421–452) contains only 10 entries. If the production templates list really is 25+, consider adding the rest (or trimming to a representative subset and softening the claim to e.g. "10 representative templates from a 25+ catalogue").

3. Unrelated nav reordering in `mkdocs.yml` (low)

The diff also moves Deep Research Trajectories from before Design Principles to after RQA Dataset:

-      - Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md
       - Design Principles: devnotes/posts/design-principles.md
       - RQA Dataset: devnotes/posts/rqa.md
+      - Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md
+      - Prompt Sensitivity: devnotes/posts/prompt-sensitivity.md

The PR description doesn't mention this reordering, only the new file. If the intent is purely to keep newest posts at the bottom (Deep Research is dated 2026-02-10 and the new post is 2026-02-18, both newer than Design Principles), that's fine — but worth mentioning in the PR description to avoid looking like an accident. Otherwise, consider reverting the move.

4. Step 1 walkthrough understates what `seed_data` actually is (low)

Step 1 shows a simple pd.DataFrame of 5 preambles loaded with SamplingStrategy.SHUFFLE. The full source instead constructs a cross-product DataFrame of 50 rows (5 preambles × 10 format templates), which is the basis for the regex-pairing trick described later. The walkthrough never reconciles this — readers might miss that the seed dataset itself is doing meaningful structural work in the production pipeline. A one-line bridge ("In production we cross-product these with format templates — see the appendix") would close the gap.

5. Minor

Line 8 frontmatter date 2026-02-18 is consistent with the previous devnote (2026-02-10); fine.
Line 33 image is referenced as ![Prompt Anatomy](prompt_anatomy.png) — relative path next to the post; matches MkDocs Material conventions used elsewhere in docs/devnotes/posts/.
Line 600 closes with a sentence pointing to the GitHub repo as "documentation," which is technically inaccurate (the docs site is at nvidia-nemo.github.io/DataDesigner/). Worth fixing the link target.
The two judges shown in Step 3 (format_compliance, preamble_quality) are described as "dual" judges, but the appendix script actually defines four judges (format_compliance, regex_alignment, order_coherence, preamble_quality). The pipeline diagram on line 137 also says "DUAL QUALITY JUDGES" but lists only two. Either rename the section ("Quality judges") or update the count to 4 to match the source.
The post mentions add_prompt_variations.py (line 316) but this script isn't in this PR or referenced from anywhere in the repo. If it's an internal/external script, consider adding a footnote so readers don't go searching for it.

Verdict

Docs-only, low-risk merge. No blocking issues, but I'd recommend addressing the narrative-vs-appendix mismatches before merging — particularly (1) the "1,215 combinations" claim that doesn't match the full source, (2) the "25+ templates" claim that doesn't match the 10 in the script, and (5) the "dual judges" framing when the appendix has four. These are the kind of inconsistencies a careful reader will flag, and they're cheap to fix.

- Re-sync narrative with appendix: 6 samplers ending in preamble_format_order (3,240 combos, not 1,215); three LLM generation columns; four judges - Update Step 1 to show the 10-template x 5-preamble cross-product seed - Rename "Dual Quality Judges" -> "Quality Judges" and describe all four (format_compliance, regex_alignment, order_coherence, preamble_quality) - Replace "25+ format templates" with accurate "10" - Drop internal add_prompt_variations.py reference - Revert unrelated Deep Research Trajectories nav reorder in mkdocs.yml

Avoids collisions with common HTML/CoT tags (, , , etc.) that would cause re.search to extract the wrong letter from model output.

johnnygreco · 2026-05-28T14:27:34Z

Thanks @dhruvnathawani! Can you ask your agent to copy the post to the Fern documentation as well? We are getting ready to migrate and for the moment need to write the docs in both places.

johnnygreco · 2026-05-28T14:36:04Z

can we put this in the assets folder following the pattern of the other posts?

johnnygreco · 2026-05-28T14:42:59Z

+
+# **Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles**
+
+Models behave differently based on how a question is phrased --- a "cynical senior dev" and a "curious student" get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies.


This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies.

Do we have an concrete stats we can quote?

No published benchmark numbers to quote here. Softened the claim to "This approach is now used in Nemotron training mixtures to address the prompt-format brittleness observed in internal testing."

johnnygreco · 2026-05-28T14:46:13Z

+
+A prompt to an LLM typically has three distinct components: the **preamble** (high-level instructions), the **problem** (the actual question or task), and the **format instruction** (how to structure the answer). Prompt sensitivity is the phenomenon where a model's accuracy changes significantly based on how the preamble and format instruction are phrased, even when the underlying problem is identical.
+
+![Prompt Anatomy](prompt_anatomy.png)


This figure seems redundant with the diagram below?

johnnygreco · 2026-05-28T14:53:20Z

+
+---
+
+## **Goal**


I'm wondering if this "Goal" section would fit better some time after the "What is Prompt Sensitivity?" section. Then the flow would be problem -> solution/goal

Moved — Goal now sits between "What Is Prompt Sensitivity?" and "Pipeline Architecture." Flow is now problem → solution → how

johnnygreco · 2026-05-28T14:56:33Z

+
+---
+
+## **Step 1: Curated Seed Examples**


The diagram says "Stage" but the detailed sections call them "Step"

Renamed all ASCII headers from STAGE → STEP

johnnygreco · 2026-05-28T15:03:09Z

+
+## **Step 4: Integration into Training Mixtures**
+
+Generated preambles don't exist in isolation. They feed into a YAML-driven training-mixture script that operates at production scale:


Can we take a step back here and explain what this step is all about in more detail?

Added framing for why the mixture script exists

johnnygreco · 2026-05-28T15:05:39Z

+
+# --- Model + config ---
+config = dd.DataDesignerConfigBuilder(model_configs=[
+    dd.ModelConfig(alias="gen", model="qwen/qwen3-235b-a22b", provider="nvidia"),


Does this model work with NVBuild?

- Update publication date to 2026-05-28 - Move Prompt Sensitivity to top of mkdocs.yml devnotes nav - Drop prompt_anatomy.png (redundant with ASCII anatomy diagram) - Reorder: Goal section now follows What Is Prompt Sensitivity - Rename ASCII Stage -> Step for consistency with section headers; split LLM generation into its own Step 3; renumber to 5 steps - Soften 'directly improved' claim to 'used in Nemotron training mixtures' (no published benchmark numbers) - Add inline note that qwen3-235b-a22b is hosted on build.nvidia.com - Expand Step 5 (training mixtures): explain diversity-layer model, majority_percentage trade-off, regex MCQ-detection rationale - Port devnote to Fern docs site (fern/.../prompt-sensitivity.mdx) and register at top of Fern Dev Notes nav

dhruvnathawani · 2026-05-28T18:02:35Z

Fern

Done — ported the post to fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx and registered it at the top of the Dev Notes section in fern/versions/latest.yml. Both MkDocs and Fern previews should pick it up, let me know if this looks okay

johnnygreco

Awesome, thanks @dhruvnathawani

Image used in both the Dev Notes index BlogCard and at the top of the article body. Compressed to 256-color palette + 1200px width to stay under the 600 KB pre-commit large-file limit.

Replace the Pillow MEDIANCUT/dither version with a pngquant 128-color quantization at full 1536x1024 resolution. Better visual quality at 570 KB (still under the 600 KB pre-commit large-file limit).

greptile-apps · 2026-05-28T19:45:44Z

Want your agent to iterate on Greptile's feedback? Try greploops.

johnnygreco

🛸

dhruvnathawani and others added 2 commits February 23, 2026 15:05

docs: add prompt sensitivity devnote

65a0f4e

Merge branch 'main' into dhruv/devnotes/prompt-sensitivity

bf40d6d

dhruvnathawani marked this pull request as ready for review February 26, 2026 02:14

dhruvnathawani requested a review from a team as a code owner February 26, 2026 02:14

dhruvnathawani requested review from johnnygreco, kirit93, mvansegbroeck and nabinchha February 26, 2026 02:15

dhruvnathawani added 2 commits February 26, 2026 11:54

docs: revamp prompt sensitivity devnote with regex-paired templates a…

5b5551f

…nd code dropdown

Merge branch 'dhruv/devnotes/prompt-sensitivity' of https://github.co…

bfca665

…m/NVIDIA-NeMo/DataDesigner into dhruv/devnotes/prompt-sensitivity

dhruvnathawani marked this pull request as draft March 12, 2026 02:26

github-actions Bot mentioned this pull request Apr 20, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

Merge branch 'main' into dhruv/devnotes/prompt-sensitivity

401b861

dhruvnathawani marked this pull request as ready for review May 27, 2026 17:00

dhruvnathawani temporarily deployed to agentic-ci May 27, 2026 17:00 — with GitHub Actions Inactive

dhruvnathawani and others added 2 commits May 27, 2026 14:21

Merge branch 'main' into dhruv/devnotes/prompt-sensitivity

46fe814

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

Comment thread docs/devnotes/posts/prompt-sensitivity.md Outdated

docs: tighten angle_brackets regex to uppercase only

35fecd5

Avoids collisions with common HTML/CoT tags (, , , etc.) that would cause re.search to extract the wrong letter from model output.

johnnygreco reviewed May 28, 2026

View reviewed changes

Comment thread docs/devnotes/posts/prompt-sensitivity.md Outdated

johnnygreco reviewed May 28, 2026

View reviewed changes

Comment thread mkdocs.yml Outdated

johnnygreco reviewed May 28, 2026

View reviewed changes

docs(fern): add BlogCard for prompt sensitivity devnote

dad4239

johnnygreco previously approved these changes May 28, 2026

View reviewed changes

docs(fern): add hero image for prompt sensitivity devnote

2c82c4e

Image used in both the Dev Notes index BlogCard and at the top of the article body. Compressed to 256-color palette + 1200px width to stay under the 600 KB pre-commit large-file limit.

dhruvnathawani dismissed johnnygreco’s stale review via 2c82c4e May 28, 2026 18:53

docs(fern): recompress prompt sensitivity hero via pngquant

a2d0038

Replace the Pillow MEDIANCUT/dither version with a pngquant 128-color quantization at full 1536x1024 resolution. Better visual quality at 570 KB (still under the 600 KB pre-commit large-file limit).

johnnygreco approved these changes May 28, 2026

View reviewed changes

dhruvnathawani merged commit 0532f96 into main May 28, 2026
61 checks passed


		# Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles

		Models behave differently based on how a question is phrased --- a "cynical senior dev" and a "curious student" get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies.


		A prompt to an LLM typically has three distinct components: the preamble (high-level instructions), the problem (the actual question or task), and the format instruction (how to structure the answer). Prompt sensitivity is the phenomenon where a model's accuracy changes significantly based on how the preamble and format instruction are phrased, even when the underlying problem is identical.

		![Prompt Anatomy](prompt_anatomy.png)


		## Step 4: Integration into Training Mixtures

		Generated preambles don't exist in isolation. They feed into a YAML-driven training-mixture script that operates at production scale:

Conversation

dhruvnathawani commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the post

Files changed

Uh oh!

greptile-apps Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 27, 2026

Review: PR #351 — docs: add prompt sensitivity devnote

Summary

Findings

1. Inconsistency between narrative pipeline and the "full source" script (medium)

2. "25+ format templates" claim vs. 10 templates in the source (low–medium)

3. Unrelated nav reordering in mkdocs.yml (low)

4. Step 1 walkthrough understates what seed_data actually is (low)

5. Minor

Verdict

Uh oh!

Uh oh!

johnnygreco commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhruvnathawani commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 28, 2026

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

dhruvnathawani commented Feb 23, 2026 •

edited

Loading

greptile-apps Bot commented Feb 26, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

Review: PR #351 — `docs: add prompt sensitivity devnote`

3. Unrelated nav reordering in `mkdocs.yml` (low)

4. Step 1 walkthrough understates what `seed_data` actually is (low)

dhruvnathawani commented May 28, 2026 •

edited

Loading