-
Notifications
You must be signed in to change notification settings - Fork 182
Comparing changes
Open a pull request
base repository: NVIDIA-NeMo/DataDesigner
base: v0.5.1
head repository: NVIDIA-NeMo/DataDesigner
compare: v0.5.2
- 13 commits
- 62 files changed
- 9 contributors
Commits on Feb 23, 2026
-
fix: repair notebook CI (dead model, missing API key, pyarrow type bu…
…g) (#348) * fix: repair notebook CI by replacing dead vision model and adding missing API key - Replace `meta/llama-4-scout-17b-16e-instruct` (no longer serving on build.nvidia.com) with `nvidia/nemotron-nano-12b-v2-vl` (project default) in tutorial notebook 4 - Add `OPENROUTER_API_KEY` to the `build-notebooks` workflow so notebooks 5 and 6 (which use OpenRouter for image generation) can authenticate - Regenerate colab notebooks to reflect the model change * fix: handle pyarrow list types in notebook 6 display_image When image columns are loaded from parquet with pyarrow backend, list values are pyarrow ListScalars, not Python lists. The isinstance(x, list) check fails, causing the whole ListScalar to be treated as a single path string (producing filenames ending in `png')]`). Use isinstance(x, str) instead to correctly handle any iterable type.
Configuration menu - View commit details
-
Copy full SHA for 4635846 - Browse repository at this point
Copy the full SHA 4635846View commit details
Commits on Feb 24, 2026
-
Update top models usage chart for 1/24-2/24/2026 (#353)
Replace the top-models pie chart with updated telemetry data and update the date range in the README. Co-authored-by: Cursor <cursoragent@cursor.com>
Configuration menu - View commit details
-
Copy full SHA for ec59c52 - Browse repository at this point
Copy the full SHA ec59c52View commit details
Commits on Feb 25, 2026
-
docs: add structured outputs SDG dev notes (#338)
* devnotes: add structured outputs SDG blog post * Add author * Add author * Add author * docs: add benchmark links, clean up flowchart, remove em dashes * docs: add collapsible demo script, use default DD config, clean up formatting * docs: update baseline error rate, remove specific percentage targets * docs: widen ASCII pipeline diagram, update baseline error rate * docs: reduce heading levels per review feedback * docs: add note on extending demo to YAML/XML formats * docs: clarify baseline error rate range (20-35% depending on benchmark) * docs: increase diagram spacing * Update typo Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * docs: use dd.SamplingStrategy instead of explicit import --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for f07624b - Browse repository at this point
Copy the full SHA f07624bView commit details -
feat: add processor plugin support (#299)
* feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/
Configuration menu - View commit details
-
Copy full SHA for 982ce79 - Browse repository at this point
Copy the full SHA 982ce79View commit details -
chore: plans for async generators and task-queue dataset builder (#347)
* chore: plans for async generators and task-queue dataset builder Part of #346 * address review feedback on async generators plan - Decouple scheduler semaphore (coarse resource guard) from PR #344's adaptive throttle manager (per-key API concurrency) - Add side-effect output column mapping to dependency resolution - Mandate cell-level merge writes, remove unsafe update_record option - Add is_stateful generator property for reentrancy control - Add retry & salvage policy for transient failures - Scope allow_resize out of async v1 (falls back to sync) - Fix asyncio.to_thread reference semantics, require defensive copies - Add new test cases for all above * add symmetric generate/agenerate bridge and plugin compatibility notes - Only one of generate/agenerate needs to be implemented; base class bridges in both directions (to_thread for sync→async, asyncio.run for async→sync) - Document impact on column generator plugins, custom columns, and processor plugins (all backward-compatible, no changes required) * add reference diagrams and clarify statefulness concept - Add plans/346/diagrams.md with 5 Mermaid diagrams: task lifecycle, scheduler main loop, dependency resolution example, concurrency layers, and row group pipelining - Clarify in plan that statefulness (concurrency safety) and sync/async (I/O model) are orthogonal concerns * add edge case handling and open questions - Eager row-drop propagation: failed rows are dropped across all columns to avoid wasting compute on incomplete rows - Out-of-order row group checkpointing via index-based file naming - Pre-batch processor failure skips the entire row group - Salvage rounds get separate error threshold and config knobs - Undersized last row group note - Open questions: thread pool sizing, silent task hangs * refine async scheduler plan safeguards - add submission budget controls to prevent unbounded parked tasks - clarify DAG validation, safe async-to-sync bridging, and row-scoped drop policy - align diagrams and unresolved risk wording with latest design decisions * refine plan with UX considerations and design clarifications - add UX considerations section: progress display strategy, peak memory cap, new config knobs, async custom columns and plugin upgrade path, what stays the same - replace allow_resize silent fallback with explicit DatasetGenerationError at startup; move to Follow-ups section - consolidate all deferred work into Out of scope / Follow-ups subsections - fix five internal inconsistencies: progress tracking in Step 4, missing async_max_concurrent_row_groups in scheduler constructor, annotate _ensure_async_engine_loop as existing, SamplerColumnGenerator dual-wrapper scope (applies to all FromScratchColumnGenerator subclasses), stateful serialization vs row group admission clarification - resolve previously open decisions: asyncio.Event over Condition, task_model as own module, async_max_concurrent_row_groups default 3, async_salvage_max_attempts_per_task dropped in favour of max_rounds+1 semantics, thread pool keep default for v1 - fix CustomColumnGenerator FULL_COLUMN async path (needs own agenerate branching on strategy); note ValidationColumnGenerator internal threading * document relation to PR #269 and fix scheduler diagram - add "Relation to PR #269" section explaining what we adopted (dependency source, trait inference, completion tracker design, statefulness separation) and what we changed (row-group tasks instead of cell-level nodes, ROW_STREAMABLE omitted) - fix scheduler main loop diagram: add async_max_concurrent_row_groups admission step, pre-batch failure path (skip row group + release slot), and loop back to ADMIT after row group completion * add profiling/tracing section to async scheduler plan - TaskTrace dataclass spec in Step 3 (opt-in, zero overhead when disabled) - trace=True param on AsyncTaskScheduler constructor in Step 4 - Step 8 benchmark references trace for timing measurements - New Profiling section: instrumentation points, example output table, usage snippet * refine async scheduler plan from review - Replace dependency map with static ExecutionGraph class (upstream, downstream, strategy, topological_order, critical_path, task_count, to_mermaid accessors) - Use row-group-local indices in CompletionTracker instead of global - Clarify from-scratch columns are FULL_COLUMN with empty upstream deps, not a separate strategy enum value - Remove reference to non-existent ColumnGeneratorCellByCell - Expand PR #269 comparison: ExecutionTraits → GenerationStrategy * fix graph complexity notation and clarify tracker API - Correct "N columns, N edges" to "O(C) nodes, O(C²) edges worst-case" - Add dispatched set param to get_ready_tasks to prevent double-dispatch - Clarify is_row_group_complete drop_row interaction * add PR breakdown, code sketches, and throttle note - Add PR breakdown section with 4 PRs, dependency graph, and "what works after merge" for each - Add code-sketches.md with structural sketches of main components - Reorganize test plan by PR (unit tests per PR, integration in PR 4) - Note that throttle manager (PR #344) is optional; scheduler works without it initially * refine concurrency model and add multi-column handling - Rename scheduler semaphore to execution semaphore for clarity - Split execution semaphore from submission budget as distinct concerns with separate semaphores - Add reacquire step to dispatch pattern after throttle wait - Add multi-column generator handling via instance dedup on the scheduler (graph stays column-level) * add compute-bound generator risk and follow-up GIL contention with CPU-bound custom generators and event loop starvation with native async compute are documented as v1 risks. ProcessPoolExecutor routing via is_cpu_bound noted as follow-up. * clarify compute-bound risk as thread pool starvation Compute-heavy tasks saturating the thread pool starve I/O-bound tasks (LLM calls) from acquiring threads, not just GIL contention. * add async guidance for plugins and custom columns Compute-bound plugins should implement generate(), not agenerate(), to keep CPU work off the event loop. Same rule for custom columns: only use async def for I/O-bound work.
Configuration menu - View commit details
-
Copy full SHA for 47e52e5 - Browse repository at this point
Copy the full SHA 47e52e5View commit details
Commits on Feb 26, 2026
-
Configuration menu - View commit details
-
Copy full SHA for a101760 - Browse repository at this point
Copy the full SHA a101760View commit details
Commits on Mar 2, 2026
-
Configuration menu - View commit details
-
Copy full SHA for d5b9850 - Browse repository at this point
Copy the full SHA d5b9850View commit details
Commits on Mar 3, 2026
-
chore: bump cryptography 46.0.3 → 46.0.5 and pillow 12.1.0 → 12.1.1 (#…
…364) Address security vulnerabilities flagged in dependency scan: - cryptography: transitive dep updated via lock file - pillow: lower bound bumped in pyproject.toml to 12.1.1
Configuration menu - View commit details
-
Copy full SHA for f251446 - Browse repository at this point
Copy the full SHA f251446View commit details
Commits on Mar 4, 2026
-
feat: add Streamable HTTP transport support for remote MCP providers (#…
…358) * feat: add Streamable HTTP transport support for remote MCP providers (#357) Add `streamable_http` as a supported transport type for `MCPProvider`, enabling connections to MCP servers that use the Streamable HTTP protocol (e.g. Tavily remote endpoints). Previously only SSE transport was supported, causing silent 5-minute timeouts when connecting to incompatible endpoints. - Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]` (default remains `"sse"` for backwards compatibility) - Route `streamable_http` providers through `streamablehttp_client` from the MCP SDK in `MCPIOService._get_or_create_session()` - Handle variable-length context manager results from MCP transport clients - Add `DataDesigner.list_mcp_tool_names()` for discovering available tools - Update CLI form builder and controller to support the new transport option - Add tests for streamable_http config, session creation, and form builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * updates * simplify import * address greptile comments --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for e4857f6 - Browse repository at this point
Copy the full SHA e4857f6View commit details -
docs: update README token badge to 150+ billion (#367)
* docs: update README token badge to 150+ billion Refresh the README tokens-generated badge so the project claim reflects the current 150+ billion total. * fix broken link
Configuration menu - View commit details
-
Copy full SHA for 3fa8eb7 - Browse repository at this point
Copy the full SHA 3fa8eb7View commit details -
docs: rename structured outputs dev note for Nemotron (#368)
Align the dev note path/nav with the Nemotron-specific title and add a blog excerpt marker for cleaner post previews.
Configuration menu - View commit details
-
Copy full SHA for be91adc - Browse repository at this point
Copy the full SHA be91adcView commit details -
chore: fix inaccuracies and improve AGENTS.md (#369)
* Add code organization, design principles, and test guidelines to AGENTS.md - Code organization: public before private, class method ordering, section comments for larger modules - Naming: function names must start with action verbs - Design principles: DRY, KISS, YAGNI, SOLID guidelines - Testing: parametrization, minimal fixtures, mock at boundaries, test behavior not implementation * docs: fix inaccuracies and improve AGENTS.md consistency - Update ruff version to >=0.14.10 and Python target to 3.10+ - Add missing linter rules (TID, UP006, UP007, UP045) - Document `from __future__ import annotations` as project convention - Merge duplicate Naming sections, deduplicate type annotation guidance - Clarify DRY vs KISS tension with "third occurrence" rule of thumb - Fix test example missing `Any` import - Add `make perf-import` to Common Development Tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: clarify class method ordering guideline in AGENTS.md Update to match actual codebase convention: dunders first, then properties, then public methods, then private helpers. Add note about grouping related method types together. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for ddd6eb4 - Browse repository at this point
Copy the full SHA ddd6eb4View commit details -
fix: include plugin column types in display_sample_record() (#365)
* fix: include plugin column types in display_sample_record() Replace hardcoded column type list with dynamic iteration over get_column_display_order(), which already includes plugin-registered types. Column types with dedicated display sections (SEED_DATASET, IMAGE, LLM_CODE, VALIDATION, LLM_JUDGE) are excluded from the "Generated Columns" table as before. Also display side_effect_columns for plugin column types, matching the existing behavior for CUSTOM columns. Fixes #345 Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Made-with: Cursor Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Made-with: Cursor * fix: use is_plugin_column_type() for side-effect column display Add is_plugin_column_type() helper to column_types.py to avoid redundant plugin_manager.get_plugin_column_types() calls. Use it in display_sample_record() to show side_effect_columns for plugin column types, matching existing CUSTOM behavior. Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Made-with: Cursor * fix: handle string column_type in is_plugin_column_type() Column configs store column_type as a Literal string, not a DataDesignerColumnType enum. Accept both str and enum to avoid AttributeError when calling .value on a plain string. Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Made-with: Cursor * test: improve plugin column display tests - Move test imports to module level per project convention - Replace plumbing-only test with one that renders a fake plugin column (with side-effect columns) to HTML and asserts the values appear in the output - Add parametrized test for is_plugin_column_type() verifying all built-in types return False for both enum and string forms Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Made-with: Cursor * fix: preserve original display order for side-effect columns Render primary column before side-effect columns, matching the existing CUSTOM column behavior. Avoids introducing display ordering discrepancies. Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Made-with: Cursor --------- Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Co-authored-by: Johnny Greco <jogreco@nvidia.com>
Configuration menu - View commit details
-
Copy full SHA for e2c94da - Browse repository at this point
Copy the full SHA e2c94daView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff v0.5.1...v0.5.2