Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: NVIDIA-NeMo/DataDesigner
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.5.1
Choose a base ref
...
head repository: NVIDIA-NeMo/DataDesigner
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.5.2
Choose a head ref
  • 13 commits
  • 62 files changed
  • 9 contributors

Commits on Feb 23, 2026

  1. fix: repair notebook CI (dead model, missing API key, pyarrow type bu…

    …g) (#348)
    
    * fix: repair notebook CI by replacing dead vision model and adding missing API key
    
    - Replace `meta/llama-4-scout-17b-16e-instruct` (no longer serving on
      build.nvidia.com) with `nvidia/nemotron-nano-12b-v2-vl` (project default)
      in tutorial notebook 4
    - Add `OPENROUTER_API_KEY` to the `build-notebooks` workflow so notebooks
      5 and 6 (which use OpenRouter for image generation) can authenticate
    - Regenerate colab notebooks to reflect the model change
    
    * fix: handle pyarrow list types in notebook 6 display_image
    
    When image columns are loaded from parquet with pyarrow backend,
    list values are pyarrow ListScalars, not Python lists. The
    isinstance(x, list) check fails, causing the whole ListScalar to be
    treated as a single path string (producing filenames ending in
    `png')]`). Use isinstance(x, str) instead to correctly handle any
    iterable type.
    andreatgretel authored Feb 23, 2026
    Configuration menu
    Copy the full SHA
    4635846 View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2026

  1. Update top models usage chart for 1/24-2/24/2026 (#353)

    Replace the top-models pie chart with updated telemetry data
    and update the date range in the README.
    
    Co-authored-by: Cursor <cursoragent@cursor.com>
    kirit93 and cursoragent authored Feb 24, 2026
    Configuration menu
    Copy the full SHA
    ec59c52 View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2026

  1. docs: add structured outputs SDG dev notes (#338)

    * devnotes: add structured outputs SDG blog post
    
    * Add author
    
    * Add author
    
    * Add author
    
    * docs: add benchmark links, clean up flowchart, remove em dashes
    
    * docs: add collapsible demo script, use default DD config, clean up formatting
    
    * docs: update baseline error rate, remove specific percentage targets
    
    * docs: widen ASCII pipeline diagram, update baseline error rate
    
    * docs: reduce heading levels per review feedback
    
    * docs: add note on extending demo to YAML/XML formats
    
    * docs: clarify baseline error rate range (20-35% depending on benchmark)
    
    * docs: increase diagram spacing
    
    * Update typo
    
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    
    * docs: use dd.SamplingStrategy instead of explicit import
    
    ---------
    
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    dhruvnathawani and greptile-apps[bot] authored Feb 25, 2026
    Configuration menu
    Copy the full SHA
    f07624b View commit details
    Browse the repository at this point in the history
  2. feat: add processor plugin support (#299)

    * feat: add processor plugin support
    
    Add PluginType.PROCESSOR to the plugin system, enabling third-party
    processor plugins via entry points. Includes a demo plugin package
    with RegexFilterProcessor (process_before_batch) and
    SemanticDedupProcessor (process_after_generation).
    
    - Add PluginType.PROCESSOR with processor_type discriminator
    - Create processor_types.py for ProcessorConfigT with plugin injection
    - Register plugin processors in engine ProcessorRegistry
    - Use RLock in PluginRegistry to prevent deadlocks during discovery
    - Add demo package: data-designer-demo-processors
    - Update processor and plugin documentation
    
    * test: add processor plugin registration test
    
    Verify that processor plugins from PluginRegistry are picked up
    by create_default_processor_registry and registered correctly.
    
    * test: simplify processor plugin registration test
    
    * move ProcessorConfig to base and convert demo to e2e test
    
    - Move ProcessorConfig from processors.py to config.base to guard
      against circular deps (alongside SingleColumnConfig)
    - Delete demo/ directory with regex_filter and semantic_dedup plugins
    - Add regex_filter as an e2e processor plugin test in tests_e2e/
    
    * move plan to plans/299/
    andreatgretel authored Feb 25, 2026
    Configuration menu
    Copy the full SHA
    982ce79 View commit details
    Browse the repository at this point in the history
  3. chore: plans for async generators and task-queue dataset builder (#347)

    * chore: plans for async generators and task-queue dataset builder
    
    Part of #346
    
    * address review feedback on async generators plan
    
    - Decouple scheduler semaphore (coarse resource guard) from PR #344's
      adaptive throttle manager (per-key API concurrency)
    - Add side-effect output column mapping to dependency resolution
    - Mandate cell-level merge writes, remove unsafe update_record option
    - Add is_stateful generator property for reentrancy control
    - Add retry & salvage policy for transient failures
    - Scope allow_resize out of async v1 (falls back to sync)
    - Fix asyncio.to_thread reference semantics, require defensive copies
    - Add new test cases for all above
    
    * add symmetric generate/agenerate bridge and plugin compatibility notes
    
    - Only one of generate/agenerate needs to be implemented; base class
      bridges in both directions (to_thread for sync→async, asyncio.run
      for async→sync)
    - Document impact on column generator plugins, custom columns, and
      processor plugins (all backward-compatible, no changes required)
    
    * add reference diagrams and clarify statefulness concept
    
    - Add plans/346/diagrams.md with 5 Mermaid diagrams: task lifecycle,
      scheduler main loop, dependency resolution example, concurrency
      layers, and row group pipelining
    - Clarify in plan that statefulness (concurrency safety) and sync/async
      (I/O model) are orthogonal concerns
    
    * add edge case handling and open questions
    
    - Eager row-drop propagation: failed rows are dropped across all columns
      to avoid wasting compute on incomplete rows
    - Out-of-order row group checkpointing via index-based file naming
    - Pre-batch processor failure skips the entire row group
    - Salvage rounds get separate error threshold and config knobs
    - Undersized last row group note
    - Open questions: thread pool sizing, silent task hangs
    
    * refine async scheduler plan safeguards
    
    - add submission budget controls to prevent unbounded parked tasks
    - clarify DAG validation, safe async-to-sync bridging, and row-scoped drop policy
    - align diagrams and unresolved risk wording with latest design decisions
    
    * refine plan with UX considerations and design clarifications
    
    - add UX considerations section: progress display strategy, peak memory
      cap, new config knobs, async custom columns and plugin upgrade path,
      what stays the same
    - replace allow_resize silent fallback with explicit DatasetGenerationError
      at startup; move to Follow-ups section
    - consolidate all deferred work into Out of scope / Follow-ups subsections
    - fix five internal inconsistencies: progress tracking in Step 4, missing
      async_max_concurrent_row_groups in scheduler constructor, annotate
      _ensure_async_engine_loop as existing, SamplerColumnGenerator dual-wrapper
      scope (applies to all FromScratchColumnGenerator subclasses),
      stateful serialization vs row group admission clarification
    - resolve previously open decisions: asyncio.Event over Condition,
      task_model as own module, async_max_concurrent_row_groups default 3,
      async_salvage_max_attempts_per_task dropped in favour of max_rounds+1
      semantics, thread pool keep default for v1
    - fix CustomColumnGenerator FULL_COLUMN async path (needs own agenerate
      branching on strategy); note ValidationColumnGenerator internal threading
    
    * document relation to PR #269 and fix scheduler diagram
    
    - add "Relation to PR #269" section explaining what we adopted
      (dependency source, trait inference, completion tracker design,
      statefulness separation) and what we changed (row-group tasks
      instead of cell-level nodes, ROW_STREAMABLE omitted)
    - fix scheduler main loop diagram: add async_max_concurrent_row_groups
      admission step, pre-batch failure path (skip row group + release
      slot), and loop back to ADMIT after row group completion
    
    * add profiling/tracing section to async scheduler plan
    
    - TaskTrace dataclass spec in Step 3 (opt-in, zero overhead when disabled)
    - trace=True param on AsyncTaskScheduler constructor in Step 4
    - Step 8 benchmark references trace for timing measurements
    - New Profiling section: instrumentation points, example output table, usage snippet
    
    * refine async scheduler plan from review
    
    - Replace dependency map with static ExecutionGraph class (upstream,
      downstream, strategy, topological_order, critical_path, task_count,
      to_mermaid accessors)
    - Use row-group-local indices in CompletionTracker instead of global
    - Clarify from-scratch columns are FULL_COLUMN with empty upstream
      deps, not a separate strategy enum value
    - Remove reference to non-existent ColumnGeneratorCellByCell
    - Expand PR #269 comparison: ExecutionTraits → GenerationStrategy
    
    * fix graph complexity notation and clarify tracker API
    
    - Correct "N columns, N edges" to "O(C) nodes, O(C²) edges worst-case"
    - Add dispatched set param to get_ready_tasks to prevent double-dispatch
    - Clarify is_row_group_complete drop_row interaction
    
    * add PR breakdown, code sketches, and throttle note
    
    - Add PR breakdown section with 4 PRs, dependency graph, and
      "what works after merge" for each
    - Add code-sketches.md with structural sketches of main components
    - Reorganize test plan by PR (unit tests per PR, integration in PR 4)
    - Note that throttle manager (PR #344) is optional; scheduler works
      without it initially
    
    * refine concurrency model and add multi-column handling
    
    - Rename scheduler semaphore to execution semaphore for clarity
    - Split execution semaphore from submission budget as distinct
      concerns with separate semaphores
    - Add reacquire step to dispatch pattern after throttle wait
    - Add multi-column generator handling via instance dedup on the
      scheduler (graph stays column-level)
    
    * add compute-bound generator risk and follow-up
    
    GIL contention with CPU-bound custom generators and event loop
    starvation with native async compute are documented as v1 risks.
    ProcessPoolExecutor routing via is_cpu_bound noted as follow-up.
    
    * clarify compute-bound risk as thread pool starvation
    
    Compute-heavy tasks saturating the thread pool starve I/O-bound
    tasks (LLM calls) from acquiring threads, not just GIL contention.
    
    * add async guidance for plugins and custom columns
    
    Compute-bound plugins should implement generate(), not agenerate(),
    to keep CPU work off the event loop. Same rule for custom columns:
    only use async def for I/O-bound work.
    andreatgretel authored Feb 25, 2026
    Configuration menu
    Copy the full SHA
    47e52e5 View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2026

  1. Configuration menu
    Copy the full SHA
    a101760 View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2026

  1. Configuration menu
    Copy the full SHA
    d5b9850 View commit details
    Browse the repository at this point in the history

Commits on Mar 3, 2026

  1. chore: bump cryptography 46.0.3 → 46.0.5 and pillow 12.1.0 → 12.1.1 (#…

    …364)
    
    Address security vulnerabilities flagged in dependency scan:
    - cryptography: transitive dep updated via lock file
    - pillow: lower bound bumped in pyproject.toml to 12.1.1
    johnnygreco authored Mar 3, 2026
    Configuration menu
    Copy the full SHA
    f251446 View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2026

  1. feat: add Streamable HTTP transport support for remote MCP providers (#…

    …358)
    
    * feat: add Streamable HTTP transport support for remote MCP providers (#357)
    
    Add `streamable_http` as a supported transport type for `MCPProvider`,
    enabling connections to MCP servers that use the Streamable HTTP protocol
    (e.g. Tavily remote endpoints). Previously only SSE transport was supported,
    causing silent 5-minute timeouts when connecting to incompatible endpoints.
    
    - Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]`
      (default remains `"sse"` for backwards compatibility)
    - Route `streamable_http` providers through `streamablehttp_client` from
      the MCP SDK in `MCPIOService._get_or_create_session()`
    - Handle variable-length context manager results from MCP transport clients
    - Add `DataDesigner.list_mcp_tool_names()` for discovering available tools
    - Update CLI form builder and controller to support the new transport option
    - Add tests for streamable_http config, session creation, and form builder
    
    Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
    
    * updates
    
    * simplify import
    
    * address greptile comments
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
    nabinchha and claude authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    e4857f6 View commit details
    Browse the repository at this point in the history
  2. docs: update README token badge to 150+ billion (#367)

    * docs: update README token badge to 150+ billion
    
    Refresh the README tokens-generated badge so the project claim reflects the current 150+ billion total.
    
    * fix broken link
    johnnygreco authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    3fa8eb7 View commit details
    Browse the repository at this point in the history
  3. docs: rename structured outputs dev note for Nemotron (#368)

    Align the dev note path/nav with the Nemotron-specific title and add a blog excerpt marker for cleaner post previews.
    johnnygreco authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    be91adc View commit details
    Browse the repository at this point in the history
  4. chore: fix inaccuracies and improve AGENTS.md (#369)

    * Add code organization, design principles, and test guidelines to AGENTS.md
    
    - Code organization: public before private, class method ordering,
      section comments for larger modules
    - Naming: function names must start with action verbs
    - Design principles: DRY, KISS, YAGNI, SOLID guidelines
    - Testing: parametrization, minimal fixtures, mock at boundaries,
      test behavior not implementation
    
    * docs: fix inaccuracies and improve AGENTS.md consistency
    
    - Update ruff version to >=0.14.10 and Python target to 3.10+
    - Add missing linter rules (TID, UP006, UP007, UP045)
    - Document `from __future__ import annotations` as project convention
    - Merge duplicate Naming sections, deduplicate type annotation guidance
    - Clarify DRY vs KISS tension with "third occurrence" rule of thumb
    - Fix test example missing `Any` import
    - Add `make perf-import` to Common Development Tasks
    
    Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
    
    * docs: clarify class method ordering guideline in AGENTS.md
    
    Update to match actual codebase convention: dunders first, then
    properties, then public methods, then private helpers. Add note
    about grouping related method types together.
    
    Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
    nabinchha and claude authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    ddd6eb4 View commit details
    Browse the repository at this point in the history
  5. fix: include plugin column types in display_sample_record() (#365)

    * fix: include plugin column types in display_sample_record()
    
    Replace hardcoded column type list with dynamic iteration over
    get_column_display_order(), which already includes plugin-registered
    types. Column types with dedicated display sections (SEED_DATASET,
    IMAGE, LLM_CODE, VALIDATION, LLM_JUDGE) are excluded from the
    "Generated Columns" table as before.
    
    Also display side_effect_columns for plugin column types, matching
    the existing behavior for CUSTOM columns.
    
    Fixes #345
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Made-with: Cursor
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Made-with: Cursor
    
    * fix: use is_plugin_column_type() for side-effect column display
    
    Add is_plugin_column_type() helper to column_types.py to avoid
    redundant plugin_manager.get_plugin_column_types() calls. Use it
    in display_sample_record() to show side_effect_columns for plugin
    column types, matching existing CUSTOM behavior.
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Made-with: Cursor
    
    * fix: handle string column_type in is_plugin_column_type()
    
    Column configs store column_type as a Literal string, not a
    DataDesignerColumnType enum. Accept both str and enum to avoid
    AttributeError when calling .value on a plain string.
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Made-with: Cursor
    
    * test: improve plugin column display tests
    
    - Move test imports to module level per project convention
    - Replace plumbing-only test with one that renders a fake plugin
      column (with side-effect columns) to HTML and asserts the values
      appear in the output
    - Add parametrized test for is_plugin_column_type() verifying all
      built-in types return False for both enum and string forms
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Made-with: Cursor
    
    * fix: preserve original display order for side-effect columns
    
    Render primary column before side-effect columns, matching the
    existing CUSTOM column behavior. Avoids introducing display
    ordering discrepancies.
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Made-with: Cursor
    
    ---------
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Co-authored-by: Johnny Greco <jogreco@nvidia.com>
    3mei and johnnygreco authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    e2c94da View commit details
    Browse the repository at this point in the history
Loading