Skip to content

docs: add prompt sensitivity devnote#351

Merged
dhruvnathawani merged 12 commits into
mainfrom
dhruv/devnotes/prompt-sensitivity
May 28, 2026
Merged

docs: add prompt sensitivity devnote#351
dhruvnathawani merged 12 commits into
mainfrom
dhruv/devnotes/prompt-sensitivity

Conversation

@dhruvnathawani

@dhruvnathawani dhruvnathawani commented Feb 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Add a dev note documenting the prompt sensitivity SDG pipeline used to generate diverse prompt variations for Nemotron training data across both SFT and RL.

What's in the post

  1. Motivation: Why prompt sensitivity matters for model robustness (up to 15 percentage point accuracy swings from phrasing changes alone)
  2. Prompt anatomy diagram showing the three variable components: preamble, problem (fixed), format instruction
  3. Goal: reduce LLM sensitivity to prompt phrasing by generating diverse preambles and format instructions while keeping the core problem unchanged
  4. Pipeline walkthrough: Seed preambles x format templates (cross-product) -> diversity samplers -> LLM preamble generation -> format instruction paraphrasing -> user prompt composition with placement ordering -> 4 quality judges -> YAML-driven training mixture integration
  5. ASCII pipeline diagram showing the 5-stage flow (seed examples -> samplers -> LLM generation -> dual judges -> training mixtures)
  6. Regex-paired format templates: 25+ answer formats (boxed, brackets, XML tags, asterisks, arrows, etc.), each paired with an extraction regex enabling both SFT diversity and RL reward parsing from a single pipeline
  7. YAML-driven mixture config with majority_percentage control (25% canonical / 75% diverse)
  8. Collapsible full source script using the DD config API (pip install + run)
  9. Key takeaways on sampler-driven diversity, format compliance gating, and unified SFT/RL design

Files changed

  1. docs/devnotes/posts/prompt-sensitivity.md (updated from draft to full PR)
  2. docs/devnotes/posts/prompt_anatomy.png (new)

@dhruvnathawani dhruvnathawani marked this pull request as ready for review February 26, 2026 02:14
@dhruvnathawani dhruvnathawani requested a review from a team as a code owner February 26, 2026 02:14
@greptile-apps

greptile-apps Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new devnote documenting the prompt sensitivity SDG pipeline — a Data Designer workflow that generates diverse MCQ preambles and format instructions to reduce LLM brittleness to prompt phrasing, used in Nemotron training mixtures for both SFT and RL.

  • New post published in both MkDocs (docs/devnotes/posts/prompt-sensitivity.md) and Fern (fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx), with navigation wired up in mkdocs.yml and fern/versions/latest.yml.
  • The double_asterisks regex template (\\*\\*([A-Za-z])\\*\\*) has a first-match extraction problem: MCQ chain-of-thought responses routinely bold individual option letters in reasoning text, causing re.search() to return the first such match rather than the intended final-answer marker — the same failure mode as the angle_brackets pattern that was already fixed in this PR.

Confidence Score: 4/5

Safe to merge after addressing the double_asterisks regex extraction bug; all other changes are documentation and navigation wiring.

The double_asterisks regex (\*\*([A-Za-z])\*\*) will silently extract the wrong letter from RL rollouts whenever the model's chain-of-thought bolds option labels mid-reasoning — a common pattern in MCQ responses. The extracted reward signal would be wrong, and readers implementing the pipeline from this doc would inherit the bug. The rest of the PR is documentation and config wiring with no functional concerns.

Both docs/devnotes/posts/prompt-sensitivity.md and fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx at the double_asterisks template entry.

Important Files Changed

Filename Overview
docs/devnotes/posts/prompt-sensitivity.md New MkDocs devnote (639 lines) documenting the prompt sensitivity SDG pipeline; the double_asterisks regex \*\*([A-Za-z])\*\* has a first-match problem that causes wrong answer extraction from chain-of-thought responses in RL contexts.
fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx Fern/MDX mirror of the devnote; carries the same double_asterisks regex issue as the .md version.
fern/versions/latest/pages/devnotes/index.mdx Adds a BlogCard entry for the new prompt-sensitivity post at the top of the devnotes index; no issues.
fern/versions/latest.yml Navigation update adding the Prompt Sensitivity page before Retriever SDG Toolkit; no issues.
mkdocs.yml Adds prompt-sensitivity.md entry at the top of the Dev Notes nav section (most-recent-first ordering); correct placement.
fern/assets/prompt-sensitivity/prompt-sensitivity-hero.png New hero image asset for the devnote; binary file, no code issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["STEP 1: Seed Examples\n10 format templates × 5 preamble anchors = 50 rows\nDataFrameSeedSource + SamplingStrategy.SHUFFLE"] --> B
    B["STEP 2: Diversity Samplers\nsentence_type(3) × tone(5) × strictness(3)\n× verbosity(3) × domain(3) × order(8) = 3,240"] --> C
    C["STEP 3: LLM Generation Columns\npreamble | format_instruction | user_prompt"] --> D
    D["STEP 4: Quality Judges\nformat_compliance + regex_alignment + order_coherence (binary)\npreamble_quality 0–3 rubric\n~15–20% fail at least one gate"] --> E
    E["STEP 5: Training Mixture Integration\nYAML-driven: majority_preamble 25% canonical\n75% diverse variations → 1M packed records"]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
docs/devnotes/posts/prompt-sensitivity.md:481
**`double_asterisks` regex has the same first-match problem as the previously fixed `angle_brackets` pattern**

`re.search(r"\*\*([A-Za-z])\*\*", response)` returns the *first* `**X**` in the response. In a typical MCQ chain-of-thought, the model discusses each option by letter — e.g. "**A** is incorrect because…, **B** also fails…, therefore **C**" — so `re.search` extracts `A` rather than the intended final answer `**C**`. The fix applied to `angle_brackets` (switching from `[A-Za-z]` to `[A-Z]`) reduces false positives from lowercase emphasis but doesn't address the positional issue for uppercase letters; using `re.findall()` and taking the last match, or anchoring the search to the final line, is the reliable fix for RL reward extraction with this format.

### Issue 2 of 2
fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx:168
**`double_asterisks` regex has the same first-match problem as the previously fixed `angle_brackets` pattern**

Same issue as in the `.md` version: `re.search(r"\*\*([A-Za-z])\*\*", response)` returns the first bolded single letter in the response. MCQ chain-of-thought reasoning commonly bolds option labels ("**A** is wrong, **B** is also wrong, **C** is correct"), so the first match will be a discarded option rather than the final answer. Using `re.findall()[-1]` or anchoring to the last line avoids this false extraction.

Reviews (9): Last reviewed commit: "docs(fern): recompress prompt sensitivit..." | Re-trigger Greptile

@dhruvnathawani dhruvnathawani marked this pull request as draft March 12, 2026 02:26
@dhruvnathawani dhruvnathawani marked this pull request as ready for review May 27, 2026 17:00
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

MkDocs preview: https://69eed809.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-351.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@github-actions

Copy link
Copy Markdown
Contributor

Review: PR #351docs: add prompt sensitivity devnote

Summary

Docs-only PR. Adds a new MkDocs devnote (docs/devnotes/posts/prompt-sensitivity.md, 594 lines) plus an inline image (prompt_anatomy.png), and registers the post in mkdocs.yml. The post explains the prompt-sensitivity SDG pipeline used for Nemotron training data: motivation, anatomy diagram, six diversity samplers, dual judges, regex-paired format templates, training-mixture YAML, and a collapsible full-source script.

API symbols cited in the code blocks (DataDesignerConfigBuilder, DataFrameSeedSource, LLMTextColumnConfig, LLMJudgeColumnConfig, Score, SamplerColumnConfig, CategorySamplerParams, ModelConfig, SamplingStrategy) all exist in packages/data-designer-config/. Heading style (# **Title**, ## **Section**) matches the other devnotes in docs/devnotes/posts/.

Findings

1. Inconsistency between narrative pipeline and the "full source" script (medium)

The body of the post and the appendix script describe different pipelines, but the post does not flag this. Readers who scroll through Step 2 and then expand "Try For Yourself" will see two non-matching configurations.

  • Step 2 narrative (around line 196 onward) defines six samplers including answer_format with 3 values: \boxed{}, \boxed{LETTER}, \boxed{<letter>}. The diagram on line 124 advertises the combinatorial space as 3 × 5 × 3 × 3 × 3 × 3 = 1,215.
  • The full source (line 481 onward) replaces answer_format with preamble_format_order (8 values for component ordering). Format diversity in the full source comes from the cross-product seed data (MCQ_FORMAT_TEMPLATES × SEED_PREAMBLES), not from a sampler. There is no answer_format sampler in the full source.
  • That means the actual combinatorial space of the full source is 3 × 5 × 3 × 3 × 3 × 8 = 3,240, sampled jointly with 50 seed rows — and the "1,215 combinations" claim in the diagram is for a different pipeline than the one shipped in the appendix.

Recommendation: either (a) align the narrative samplers to the full source, or (b) explicitly note that the narrative is a simplified didactic pipeline and the appendix is the production version, calling out the differences (cross-product seeds, the placement-order sampler).

2. "25+ format templates" claim vs. 10 templates in the source (low–medium)

Multiple places in the post advertise "25+ distinct format templates":

  • Line 384: "We curated 25+ distinct format templates spanning ..."
  • PR description: "25+ answer formats (boxed, brackets, XML tags, asterisks, arrows, etc.)"

The full source MCQ_FORMAT_TEMPLATES list (lines 421–452) contains only 10 entries. If the production templates list really is 25+, consider adding the rest (or trimming to a representative subset and softening the claim to e.g. "10 representative templates from a 25+ catalogue").

3. Unrelated nav reordering in mkdocs.yml (low)

The diff also moves Deep Research Trajectories from before Design Principles to after RQA Dataset:

-      - Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md
       - Design Principles: devnotes/posts/design-principles.md
       - RQA Dataset: devnotes/posts/rqa.md
+      - Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md
+      - Prompt Sensitivity: devnotes/posts/prompt-sensitivity.md

The PR description doesn't mention this reordering, only the new file. If the intent is purely to keep newest posts at the bottom (Deep Research is dated 2026-02-10 and the new post is 2026-02-18, both newer than Design Principles), that's fine — but worth mentioning in the PR description to avoid looking like an accident. Otherwise, consider reverting the move.

4. Step 1 walkthrough understates what seed_data actually is (low)

Step 1 shows a simple pd.DataFrame of 5 preambles loaded with SamplingStrategy.SHUFFLE. The full source instead constructs a cross-product DataFrame of 50 rows (5 preambles × 10 format templates), which is the basis for the regex-pairing trick described later. The walkthrough never reconciles this — readers might miss that the seed dataset itself is doing meaningful structural work in the production pipeline. A one-line bridge ("In production we cross-product these with format templates — see the appendix") would close the gap.

5. Minor

  • Line 8 frontmatter date 2026-02-18 is consistent with the previous devnote (2026-02-10); fine.
  • Line 33 image is referenced as ![Prompt Anatomy](prompt_anatomy.png) — relative path next to the post; matches MkDocs Material conventions used elsewhere in docs/devnotes/posts/.
  • Line 600 closes with a sentence pointing to the GitHub repo as "documentation," which is technically inaccurate (the docs site is at nvidia-nemo.github.io/DataDesigner/). Worth fixing the link target.
  • The two judges shown in Step 3 (format_compliance, preamble_quality) are described as "dual" judges, but the appendix script actually defines four judges (format_compliance, regex_alignment, order_coherence, preamble_quality). The pipeline diagram on line 137 also says "DUAL QUALITY JUDGES" but lists only two. Either rename the section ("Quality judges") or update the count to 4 to match the source.
  • The post mentions add_prompt_variations.py (line 316) but this script isn't in this PR or referenced from anywhere in the repo. If it's an internal/external script, consider adding a footnote so readers don't go searching for it.

Verdict

Docs-only, low-risk merge. No blocking issues, but I'd recommend addressing the narrative-vs-appendix mismatches before merging — particularly (1) the "1,215 combinations" claim that doesn't match the full source, (2) the "25+ templates" claim that doesn't match the 10 in the script, and (5) the "dual judges" framing when the appendix has four. These are the kind of inconsistencies a careful reader will flag, and they're cheap to fix.

dhruvnathawani and others added 2 commits May 27, 2026 14:21
- Re-sync narrative with appendix: 6 samplers ending in preamble_format_order
  (3,240 combos, not 1,215); three LLM generation columns; four judges
- Update Step 1 to show the 10-template x 5-preamble cross-product seed
- Rename "Dual Quality Judges" -> "Quality Judges" and describe all four
  (format_compliance, regex_alignment, order_coherence, preamble_quality)
- Replace "25+ format templates" with accurate "10"
- Drop internal add_prompt_variations.py reference
- Revert unrelated Deep Research Trajectories nav reorder in mkdocs.yml
Comment thread docs/devnotes/posts/prompt-sensitivity.md Outdated
Avoids collisions with common HTML/CoT tags (<b>, <i>, <p>, etc.) that
would cause re.search to extract the wrong letter from model output.
@johnnygreco

Copy link
Copy Markdown
Contributor

Thanks @dhruvnathawani! Can you ask your agent to copy the post to the Fern documentation as well? We are getting ready to migrate and for the moment need to write the docs in both places.

Comment thread docs/devnotes/posts/prompt-sensitivity.md Outdated
Comment thread mkdocs.yml Outdated
Comment thread docs/devnotes/posts/prompt_anatomy.png Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this in the assets folder following the pattern of the other posts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# **Mitigating Prompt Sensitivity: Manufacturing Robustness Through Diverse Preambles**

Models behave differently based on how a question is phrased --- a "cynical senior dev" and a "curious student" get different answers to the same problem. Using NeMo Data Designer, we built a pipeline that generates hundreds of diverse prompt preambles with controlled variation across tone, strictness, verbosity, and answer format, then validates each one for compliance. These preambles feed into a YAML-driven training mixture pipeline that prepends diverse instructions to existing SFT data at scale. This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work directly improved Nemotron model robustness on evaluation benchmarks where prompt format varies.

Do we have an concrete stats we can quote?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No published benchmark numbers to quote here. Softened the claim to "This approach is now used in Nemotron training mixtures to address the prompt-format brittleness observed in internal testing."


A prompt to an LLM typically has three distinct components: the **preamble** (high-level instructions), the **problem** (the actual question or task), and the **format instruction** (how to structure the answer). Prompt sensitivity is the phenomenon where a model's accuracy changes significantly based on how the preamble and format instruction are phrased, even when the underlying problem is identical.

![Prompt Anatomy](prompt_anatomy.png)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This figure seems redundant with the diagram below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


---

## **Goal**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this "Goal" section would fit better some time after the "What is Prompt Sensitivity?" section. Then the flow would be problem -> solution/goal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved — Goal now sits between "What Is Prompt Sensitivity?" and "Pipeline Architecture." Flow is now problem → solution → how


---

## **Step 1: Curated Seed Examples**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagram says "Stage" but the detailed sections call them "Step"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed all ASCII headers from STAGE → STEP


## **Step 4: Integration into Training Mixtures**

Generated preambles don't exist in isolation. They feed into a YAML-driven training-mixture script that operates at production scale:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take a step back here and explain what this step is all about in more detail?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added framing for why the mixture script exists


# --- Model + config ---
config = dd.DataDesignerConfigBuilder(model_configs=[
dd.ModelConfig(alias="gen", model="qwen/qwen3-235b-a22b", provider="nvidia"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this model work with NVBuild?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

- Update publication date to 2026-05-28
- Move Prompt Sensitivity to top of mkdocs.yml devnotes nav
- Drop prompt_anatomy.png (redundant with ASCII anatomy diagram)
- Reorder: Goal section now follows What Is Prompt Sensitivity
- Rename ASCII Stage -> Step for consistency with section headers;
  split LLM generation into its own Step 3; renumber to 5 steps
- Soften 'directly improved' claim to 'used in Nemotron training
  mixtures' (no published benchmark numbers)
- Add inline note that qwen3-235b-a22b is hosted on build.nvidia.com
- Expand Step 5 (training mixtures): explain diversity-layer model,
  majority_percentage trade-off, regex MCQ-detection rationale
- Port devnote to Fern docs site (fern/.../prompt-sensitivity.mdx)
  and register at top of Fern Dev Notes nav
@dhruvnathawani

dhruvnathawani commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Fern

Done — ported the post to fern/versions/latest/pages/devnotes/posts/prompt-sensitivity.mdx and registered it at the top of the Dev Notes section in fern/versions/latest.yml. Both MkDocs and Fern previews should pick it up, let me know if this looks okay

johnnygreco
johnnygreco previously approved these changes May 28, 2026

@johnnygreco johnnygreco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @dhruvnathawani

Image used in both the Dev Notes index BlogCard and at the top of the
article body. Compressed to 256-color palette + 1200px width to stay
under the 600 KB pre-commit large-file limit.
Replace the Pillow MEDIANCUT/dither version with a pngquant 128-color
quantization at full 1536x1024 resolution. Better visual quality at
570 KB (still under the 600 KB pre-commit large-file limit).
@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

@johnnygreco johnnygreco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛸

@dhruvnathawani dhruvnathawani merged commit 0532f96 into main May 28, 2026
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants