Skip to content

docs: add retriever SDG toolkit dev note#666

Merged
shan-nvidia merged 8 commits into
mainfrom
codex-sthan-retrieval-sdg-devnote
May 19, 2026
Merged

docs: add retriever SDG toolkit dev note#666
shan-nvidia merged 8 commits into
mainfrom
codex-sthan-retrieval-sdg-devnote

Conversation

@shan-nvidia

Copy link
Copy Markdown
Contributor

📋 Summary

Adds a new dev note for the data-designer-retrieval-sdg toolkit, explaining why retriever synthetic data generation matters and how the toolkit turns documents into retriever training and BEIR evaluation artifacts.

🔗 Related Issue

N/A

🔄 Changes

  • Adds the Retriever SDG Toolkit dev note to MkDocs and Fern.
  • Adds a pipeline SVG showing document chunking, grounded QA generation, deduplication/judging, and conversion outputs.
  • Adds Steve Han to the dev-note author registries and uses that author for the new post.
  • Adds the new post to the MkDocs and Fern Dev Notes navigation/index.

🧪 Testing

  • .venv/bin/mkdocs build passes
  • make check-fern-docs-locally passes
  • Unit tests added/updated: N/A - docs-only change
  • E2E tests added/updated: N/A - docs-only change

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated: N/A - dev note only

Signed-off-by: Steve Han <sthan@nvidia.com>
@shan-nvidia shan-nvidia requested a review from a team as a code owner May 15, 2026 20:24
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@github-actions

Copy link
Copy Markdown
Contributor

Review: PR #666 — docs: add retriever SDG toolkit dev note

Summary

Docs-only PR adding a new "Retriever SDG Toolkit" dev note to both the MkDocs and Fern documentation sites. Changes:

  • New post: docs/devnotes/posts/retrieval-sdg-toolkit.md (MkDocs) and fern/versions/latest/pages/devnotes/posts/retrieval-sdg-toolkit.mdx (Fern), with parallel content adapted to each engine's syntax.
  • New pipeline diagram: pipeline.svg mirrored under docs/devnotes/posts/assets/retrieval-sdg-toolkit/ and fern/assets/retrieval-sdg-toolkit/ (the two files are byte-identical — verified).
  • New author entry sthan (Steve Han) added to all three author registries: docs/devnotes/.authors.yml, fern/components/devnotes/.authors.yml, and fern/components/devnotes/authors-data.ts.
  • Navigation/index updates: top of mkdocs.yml Dev Notes (after index), top of Fern latest.yml Dev Notes section, and a new lead BlogCard in fern/versions/latest/pages/devnotes/index.mdx.

PR is +940 / -0 across 10 files. No code is touched.

Findings

Consistency with existing dev note conventions — good

  • MkDocs frontmatter uses date: + authors: and <!-- more --> excerpt marker — matches vlm-long-document-understanding.md.
  • Fern frontmatter uses title: / description: + <Authors ids={[...]} /> and {/* more */} — matches the Fern equivalent of the VLM post.
  • Author registration is mirrored across all three registries (yml × 2 + ts), which is the established pattern.
  • New post is placed first in both MkDocs nav and Fern nav, consistent with the "most recent → oldest" comment in mkdocs.yml. Date 2026-05-14 is one day before today (2026-05-15), so the ordering is correct.

Slug / filename mismatch — worth a note

The Fern file is retrieval-sdg-toolkit.mdx but its frontmatter sets slug: retriever-sdg-toolkit ("retrieval" vs "retriever"). The BlogCard href /dev-notes/retriever-sdg-toolkit matches the slug, so the link is not broken — but readers grepping for the URL by filename will be tripped up. Either:

  • Rename the file to retriever-sdg-toolkit.mdx (and adjust the path in latest.yml) so filename matches the slug, OR
  • Drop the explicit slug: and let it derive from filename (this would change the URL to /dev-notes/retrieval-sdg-toolkit; update the BlogCard href accordingly).

Either is fine; the asymmetry is the smell. The MkDocs post uses a third, longer slug (retriever-sdg-toolkit-from-documents-to-training-data), so MkDocs and Fern URLs already diverge. Probably tolerable since the two sites are separate properties, but a single canonical slug would be tidier.

External links — plausible but unverifiable from CI

The post links to several external resources that I cannot reach from this runner:

  • https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg
  • https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed
  • https://github.com/NVIDIA-NeMo/Nemotron/tree/preview/rerank-finetune-recipe-v1/src/nemotron/recipes/rerank
  • https://github.com/NVIDIA-NeMo/Automodel
  • https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1
  • https://huggingface.co/blog/nvidia/domain-specific-embedding-finetune

The reranking-recipe link uses the preview/rerank-finetune-recipe-v1 branch rather than main, which is fragile — branches like this are commonly squashed/deleted after merge. Recommend confirming with @shan-nvidia that the branch will remain valid for at least the lifetime of this post, or pinning to a commit SHA. The other URLs follow standard nvidia-nemo/HF naming and are plausible.

Code-snippet accuracy

The post documents APIs from the external data-designer-retrieval-sdg plugin (not in this repo), so I cannot statically verify them. Worth spot-checking with the plugin author:

  • from data_designer_retrieval_sdg.seed_source import DocumentChunkerSeedSource
  • from data_designer_retrieval_sdg import DocumentChunkerSeedSource, build_qa_generation_pipeline
  • from data_designer_retrieval_sdg.config import EmbeddingDedupColumnConfig
  • CLI flags: --input-dir, --output-dir, --num-files, --num-pairs, --batch-size, --preview, --corpus-id, --quality-threshold, --min-complexity, --min-hops, --similarity-threshold, --multi-doc
  • Constructor kwargs: path, file_pattern, recursive, file_extensions, min_text_length, sentences_per_chunk, num_sections, multi_doc, bundle_size, bundle_strategy, max_docs_per_bundle
  • data-designer plugin install data-designer-retrieval-sdg — verify this is the actual install command exposed by the plugin catalog.

The author has noted mkdocs build and make check-fern-docs-locally pass; that catches structural issues but not API drift against the external package. A quick pip install-and-import smoke test against the linked plugin repo would close that gap.

Asset duplication

pipeline.svg is committed twice (docs/devnotes/posts/assets/... and fern/assets/...), byte-identical. This matches the project's existing pattern (other dev notes do the same), so not a blocker — just a maintenance tax: any future tweak has to land in both places. Out of scope for this PR.

Minor copy notes

  • The post uses ASCII hyphens for em-dashes ("If your users ask questions that span multiple documents - for example..."). The VLM post uses real em-dashes ("—"). Minor stylistic divergence; pick one for consistency.
  • "Compatibility metadata and installation through the Data Designer plugin catalog." — unclear referent. Worth a one-line mention of which catalog/registry, since this is the first dev note that introduces the plugin-catalog concept end-to-end for users.

Security / sensitive content

Nothing concerning. No secrets, no internal hostnames, no embedded directives that look like injection attempts in the diff.

Verdict

Approve with minor revisions suggested. The PR is a clean, well-structured docs addition that follows existing dev-note conventions for both MkDocs and Fern. The pipeline SVG is well-crafted and accessible (has <title>/<desc>). Two recommended changes before merge:

  1. Pin or verify the preview/rerank-finetune-recipe-v1 branch link — branch URLs rot quickly.
  2. Reconcile the Fern filename / slug (retrieval-sdg-toolkit.mdx vs slug: retriever-sdg-toolkit) so future maintainers don't grep past the file.

Optional: align em-dash style with neighboring posts and consider deduplicating pipeline.svg via a build-time copy (broader project decision, not blocking).

@greptile-apps

greptile-apps Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new dev note documenting the data-designer-retrieval-sdg plugin, published to both MkDocs and Fern. Author registries are updated consistently across all three stores (.authors.yml ×2, authors-data.ts), and navigation entries are inserted at the top of the Dev Notes section in both docs systems.

  • Adds the full dev note in MkDocs (retrieval-sdg-toolkit.md) and Fern (retriever-sdg-toolkit.mdx), covering all four pipeline stages with code examples, a CLI walkthrough, and an SVG pipeline diagram duplicated to each asset tree.
  • Registers sthan and oliverholworthy in all author registries; nmulepati and jgreco were already present, so all four authors credited in the post resolve correctly.

Confidence Score: 5/5

Docs-only change with no runtime code; all author IDs, asset paths, and nav entries are internally consistent and verified by passing MkDocs and Fern builds.

The change touches only documentation, SVG assets, YAML/TypeScript author registries, and navigation config. All four author keys used in the new post exist in every registry, relative and absolute asset paths resolve correctly, and both build systems are reported as passing.

No files require special attention.

Important Files Changed

Filename Overview
docs/devnotes/posts/retrieval-sdg-toolkit.md New MkDocs dev note documenting the retrieval SDG plugin; all code examples, relative links, and author refs resolve correctly.
fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx Fern MDX counterpart of the dev note; slug, author IDs, and absolute asset path all match the registered data.
fern/components/devnotes/authors-data.ts Adds sthan and oliverholworthy; nmulepati and jgreco were already present, so all four authors used in the new post resolve correctly.
docs/devnotes/.authors.yml Adds sthan and oliverholworthy; all authors referenced in the new post already exist in the registry.
fern/components/devnotes/.authors.yml Fern author registry updated consistently with the MkDocs .authors.yml; all four post authors are present.
fern/versions/latest/pages/devnotes/index.mdx New BlogCard inserted first (most-recent ordering); href, authors array, and image asset path are all correct.
fern/versions/latest.yml Adds Retriever SDG Toolkit navigation entry pointing to the correct MDX path, placed first in dev-notes order.
mkdocs.yml Navigation entry added at the top of Dev Notes (most-recent-first), pointing to the correct file path.
docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg New accessible SVG pipeline diagram with proper title/desc elements and ARIA attributes; referenced correctly from the MkDocs post.
fern/assets/retrieval-sdg-toolkit/pipeline.svg Identical SVG placed in the Fern asset tree; referenced by the BlogCard image and the MDX post.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Source Documents] --> B[Stage 1: Bundle Docs\nsingle + multi-doc groups]
    A --> C[Stage 1: Chunk Docs\nstable segment IDs]
    B --> D[Stage 2: Extract Artifacts\nconcepts / entities / links]
    C --> E[Stage 2: Generate QA\ngrounded multi-hop questions]
    D --> F[Stage 3: Deduplicate\nnear-duplicate queries]
    E --> G[Stage 3: Judge Quality\nrelevance / support / clarity]
    F --> H[Stage 4: Convert\ntrain/val · BEIR qrels · AutoModel data]
    G --> H
Loading

Reviews (8): Last reviewed commit: "docs: clarify retriever SDG wording" | Re-trigger Greptile

Signed-off-by: Steve Han <sthan@nvidia.com>
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

MkDocs preview: https://31075f5e.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-666.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@shan-nvidia

Copy link
Copy Markdown
Contributor Author

I have read the DCO document and I hereby sign the DCO.

Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Signed-off-by: Steve Han <sthan@nvidia.com>
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Signed-off-by: Steve Han <sthan@nvidia.com>
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
Comment thread fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx Outdated
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md
Signed-off-by: Steve Han <sthan@nvidia.com>
nabinchha
nabinchha previously approved these changes May 19, 2026
Comment thread docs/devnotes/posts/retrieval-sdg-toolkit.md Outdated
johnnygreco
johnnygreco previously approved these changes May 19, 2026
@shan-nvidia shan-nvidia dismissed stale reviews from nabinchha and johnnygreco via 8afe8bc May 19, 2026 20:09
@shan-nvidia shan-nvidia requested a review from nabinchha May 19, 2026 20:13
@shan-nvidia shan-nvidia merged commit abb4a24 into main May 19, 2026
50 checks passed
@nabinchha nabinchha deleted the codex-sthan-retrieval-sdg-devnote branch May 19, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants