docs: add prompt sensitivity devnote#351
Conversation
Greptile SummaryThis PR adds a new devnote documenting the prompt sensitivity SDG pipeline — a Data Designer workflow that generates diverse MCQ preambles and format instructions to reduce LLM brittleness to prompt phrasing, used in Nemotron training mixtures for both SFT and RL.
|
| Filename | Overview |
|---|---|
| docs/devnotes/posts/prompt-sensitivity.md | New MkDocs devnote (639 lines) documenting the prompt sensitivity SDG pipeline; the double_asterisks regex \*\*([A-Za-z])\*\* has a first-match problem that causes wrong answer extraction from chain-of-thought responses in RL contexts. |
| fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx | Fern/MDX mirror of the devnote; carries the same double_asterisks regex issue as the .md version. |
| fern/versions/latest/pages/devnotes/index.mdx | Adds a BlogCard entry for the new prompt-sensitivity post at the top of the devnotes index; no issues. |
| fern/versions/latest.yml | Navigation update adding the Prompt Sensitivity page before Retriever SDG Toolkit; no issues. |
| mkdocs.yml | Adds prompt-sensitivity.md entry at the top of the Dev Notes nav section (most-recent-first ordering); correct placement. |
| fern/assets/prompt-sensitivity/prompt-sensitivity-hero.png | New hero image asset for the devnote; binary file, no code issues. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["STEP 1: Seed Examples\n10 format templates × 5 preamble anchors = 50 rows\nDataFrameSeedSource + SamplingStrategy.SHUFFLE"] --> B
B["STEP 2: Diversity Samplers\nsentence_type(3) × tone(5) × strictness(3)\n× verbosity(3) × domain(3) × order(8) = 3,240"] --> C
C["STEP 3: LLM Generation Columns\npreamble | format_instruction | user_prompt"] --> D
D["STEP 4: Quality Judges\nformat_compliance + regex_alignment + order_coherence (binary)\npreamble_quality 0–3 rubric\n~15–20% fail at least one gate"] --> E
E["STEP 5: Training Mixture Integration\nYAML-driven: majority_preamble 25% canonical\n75% diverse variations → 1M packed records"]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
docs/devnotes/posts/prompt-sensitivity.md:481
**`double_asterisks` regex has the same first-match problem as the previously fixed `angle_brackets` pattern**
`re.search(r"\*\*([A-Za-z])\*\*", response)` returns the *first* `**X**` in the response. In a typical MCQ chain-of-thought, the model discusses each option by letter — e.g. "**A** is incorrect because…, **B** also fails…, therefore **C**" — so `re.search` extracts `A` rather than the intended final answer `**C**`. The fix applied to `angle_brackets` (switching from `[A-Za-z]` to `[A-Z]`) reduces false positives from lowercase emphasis but doesn't address the positional issue for uppercase letters; using `re.findall()` and taking the last match, or anchoring the search to the final line, is the reliable fix for RL reward extraction with this format.
### Issue 2 of 2
fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx:168
**`double_asterisks` regex has the same first-match problem as the previously fixed `angle_brackets` pattern**
Same issue as in the `.md` version: `re.search(r"\*\*([A-Za-z])\*\*", response)` returns the first bolded single letter in the response. MCQ chain-of-thought reasoning commonly bolds option labels ("**A** is wrong, **B** is also wrong, **C** is correct"), so the first match will be a discarded option rather than the final answer. Using `re.findall()[-1]` or anchoring to the last line avoids this false extraction.
Reviews (9): Last reviewed commit: "docs(fern): recompress prompt sensitivit..." | Re-trigger Greptile
…m/NVIDIA-NeMo/DataDesigner into dhruv/devnotes/prompt-sensitivity
|
MkDocs preview: https://69eed809.dd-docs-preview.pages.dev Fern preview: https://nvidia-preview-pr-351.docs.buildwithfern.com/nemo/datadesigner
|
Review: PR #351 —
|
- Re-sync narrative with appendix: 6 samplers ending in preamble_format_order (3,240 combos, not 1,215); three LLM generation columns; four judges - Update Step 1 to show the 10-template x 5-preamble cross-product seed - Rename "Dual Quality Judges" -> "Quality Judges" and describe all four (format_compliance, regex_alignment, order_coherence, preamble_quality) - Replace "25+ format templates" with accurate "10" - Drop internal add_prompt_variations.py reference - Revert unrelated Deep Research Trajectories nav reorder in mkdocs.yml
Avoids collisions with common HTML/CoT tags (<b>, <i>, <p>, etc.) that would cause re.search to extract the wrong letter from model output.
|
Thanks @dhruvnathawani! Can you ask your agent to copy the post to the Fern documentation as well? We are getting ready to migrate and for the moment need to write the docs in both places. |
There was a problem hiding this comment.
can we put this in the assets folder following the pattern of the other posts?
|
|
||
| # **Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles** | ||
|
|
||
| Models behave differently based on how a question is phrased --- a "cynical senior dev" and a "curious student" get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies. |
There was a problem hiding this comment.
This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies.
Do we have an concrete stats we can quote?
There was a problem hiding this comment.
No published benchmark numbers to quote here. Softened the claim to "This approach is now used in Nemotron training mixtures to address the prompt-format brittleness observed in internal testing."
|
|
||
| A prompt to an LLM typically has three distinct components: the **preamble** (high-level instructions), the **problem** (the actual question or task), and the **format instruction** (how to structure the answer). Prompt sensitivity is the phenomenon where a model's accuracy changes significantly based on how the preamble and format instruction are phrased, even when the underlying problem is identical. | ||
|
|
||
|  |
There was a problem hiding this comment.
This figure seems redundant with the diagram below?
|
|
||
| --- | ||
|
|
||
| ## **Goal** |
There was a problem hiding this comment.
I'm wondering if this "Goal" section would fit better some time after the "What is Prompt Sensitivity?" section. Then the flow would be problem -> solution/goal
There was a problem hiding this comment.
Moved — Goal now sits between "What Is Prompt Sensitivity?" and "Pipeline Architecture." Flow is now problem → solution → how
|
|
||
| --- | ||
|
|
||
| ## **Step 1: Curated Seed Examples** |
There was a problem hiding this comment.
The diagram says "Stage" but the detailed sections call them "Step"
There was a problem hiding this comment.
Renamed all ASCII headers from STAGE → STEP
|
|
||
| ## **Step 4: Integration into Training Mixtures** | ||
|
|
||
| Generated preambles don't exist in isolation. They feed into a YAML-driven training-mixture script that operates at production scale: |
There was a problem hiding this comment.
Can we take a step back here and explain what this step is all about in more detail?
There was a problem hiding this comment.
Added framing for why the mixture script exists
|
|
||
| # --- Model + config --- | ||
| config = dd.DataDesignerConfigBuilder(model_configs=[ | ||
| dd.ModelConfig(alias="gen", model="qwen/qwen3-235b-a22b", provider="nvidia"), |
There was a problem hiding this comment.
Does this model work with NVBuild?
- Update publication date to 2026-05-28 - Move Prompt Sensitivity to top of mkdocs.yml devnotes nav - Drop prompt_anatomy.png (redundant with ASCII anatomy diagram) - Reorder: Goal section now follows What Is Prompt Sensitivity - Rename ASCII Stage -> Step for consistency with section headers; split LLM generation into its own Step 3; renumber to 5 steps - Soften 'directly improved' claim to 'used in Nemotron training mixtures' (no published benchmark numbers) - Add inline note that qwen3-235b-a22b is hosted on build.nvidia.com - Expand Step 5 (training mixtures): explain diversity-layer model, majority_percentage trade-off, regex MCQ-detection rationale - Port devnote to Fern docs site (fern/.../prompt-sensitivity.mdx) and register at top of Fern Dev Notes nav
Done — ported the post to |
johnnygreco
left a comment
There was a problem hiding this comment.
Awesome, thanks @dhruvnathawani
Image used in both the Dev Notes index BlogCard and at the top of the article body. Compressed to 256-color palette + 1200px width to stay under the 600 KB pre-commit large-file limit.
Replace the Pillow MEDIANCUT/dither version with a pngquant 128-color quantization at full 1536x1024 resolution. Better visual quality at 570 KB (still under the 600 KB pre-commit large-file limit).
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Summary
Add a dev note documenting the prompt sensitivity SDG pipeline used to generate diverse prompt variations for Nemotron training data across both SFT and RL.
What's in the post
Files changed