Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: NVIDIA-NeMo/DataDesigner
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.5.6
Choose a base ref
...
head repository: NVIDIA-NeMo/DataDesigner
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.5.7
Choose a head ref
  • 18 commits
  • 101 files changed
  • 11 contributors

Commits on Apr 9, 2026

  1. fix: narrow docs-preview workflow path filter (#515)

    The docs-preview workflow triggered on all source code changes due to
    the broad `packages/*/src/data_designer/**` path glob. This caused
    unnecessary Cloudflare Pages deployments on code-only PRs like #505.
    
    Remove the source code path filter so the workflow only triggers on
    actual docs content changes (docs/**, mkdocs.yml, and the workflow
    file itself).
    andreatgretel authored Apr 9, 2026
    Configuration menu
    Copy the full SHA
    13cd687 View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2026

  1. chore: harden CI supply chain (#517)

    * ci: harden CI supply chain
    
    Pin all GitHub Actions to commit SHAs to prevent tag-based supply chain
    attacks (same class as CVE-2025-30066). Replace softprops/action-gh-release
    (single-maintainer, no security policy) with gh CLI. Add top-level
    permissions: {} to all workflows that lacked it, enforcing least-privilege
    by default. Enable Dependabot for GitHub Actions and pip dependencies.
    
    Closes #471
    
    * fix: add dependabot pip entries for each sub-package
    
    The root directory has no pyproject.toml; the actual packages live under
    packages/data-designer-config, packages/data-designer-engine, and
    packages/data-designer.
    andreatgretel authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    54d51bd View commit details
    Browse the repository at this point in the history
  2. fix: bump pytest, aiohttp, and cryptography for security CVEs (#535)

    * fix: bump pytest, aiohttp, and cryptography for security CVEs
    
    - pytest 9.0.2 → 9.0.3 (CVE-2025-71176, High — RCE via symlink TOCTOU)
    - aiohttp 3.13.3 → 3.13.5 (10 Medium CVEs — DoS, CRLF injection, credential theft, request smuggling)
    - cryptography 46.0.6 → 46.0.7 (CVE-2026-39892, Medium — buffer overflow on Python >3.11)
    
    Add constraint-dependencies for transitive deps (aiohttp, cryptography) to
    enforce minimum safe versions across both workspace and e2e lockfiles.
    
    * style: fix indentation in tests_e2e/pyproject.toml
    
    Match the 2-space indentation used throughout the file.
    johnnygreco authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    2528741 View commit details
    Browse the repository at this point in the history
  3. fix: tune Dependabot config and fix DCO assistant bugs (#534)

    * fix: restrict Dependabot pip updates to security-only
    
    The Dependabot config added in #517 included weekly version-bump PRs for
    all three pip packages. This would generate noisy PRs for routine dep
    updates we don't need. Set open-pull-requests-limit: 0 on the pip
    ecosystems so only CVE-triggered security updates open PRs.
    
    GitHub Actions weekly bumps are kept as-is to keep SHA pins current.
    
    * fix: group Dependabot Actions PRs and fix DCO allowlist
    
    - Add a Dependabot group to bundle all GitHub Actions updates into a
      single weekly PR instead of one per action
    - Fix DCO allowlist: dependabot -> dependabot[bot] to match the actual
      GitHub username (the old value never matched, but there were no
      Dependabot PRs before #517 to expose the bug)
    
    * fix: align DCO assistant if-condition with custom sign-off text
    
    The step's if-condition checked for the default sign-off text but
    custom-pr-sign-comment uses different wording. This meant the
    issue_comment trigger was always skipped - sign-offs only worked
    by accident when a subsequent push re-triggered the action via
    pull_request_target.
    andreatgretel authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    47be28c View commit details
    Browse the repository at this point in the history
  4. ci: publish devnotes independently of releases (#536)

    * ci: add workflow to publish devnotes independently of releases
    
    Adds a GitHub Actions workflow that rebuilds the `latest` docs alias
    when devnotes change on main, so blog posts go live without cutting
    a package release.
    
    * ci: pin actions to commit SHAs and restrict default permissions
    
    Address Greptile review findings:
    - Pin checkout, setup-uv, and download-artifact to commit SHAs
      matching the pattern from #517
    - Add top-level permissions: {} to restrict default token scope
    
    * ci: build devnotes from last deployed state, not main
    
    Instead of building the full site from main (which could include
    unreleased docs), checkout the commit that latest was last built
    from (tracked in gh-pages commit messages) and overlay only
    docs/devnotes/ from main. Download notebooks from the last
    successful build-docs run instead of rebuilding them.
    
    * ci: add actions:read permission for notebook download
    
    The gh run list/download calls need actions:read on GITHUB_TOKEN,
    which is denied by the top-level permissions: {} block.
    andreatgretel authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    aee3d3f View commit details
    Browse the repository at this point in the history
  5. fix: async engine side-effect column propagation and collision resolu…

    …tion (#509)
    
    * fix: async engine side-effect column propagation and collision resolution
    
    ExecutionGraph.set_side_effect() now uses first-writer-wins instead of
    last-writer-wins, matching sync engine semantics where earlier consumers
    see the first producer's value. This prevents false DAGCircularDependencyError
    when multiple generators declare the same side-effect column at different
    pipeline stages.
    
    AsyncTaskScheduler now includes side-effect columns in _instance_to_columns
    so their values are written to the RowGroupBufferManager and available to
    downstream prompt templates.
    
    Fixes #508
    
    * fix: separate side-effect columns from completion tracking in async scheduler
    
    Side-effect columns added to _instance_to_columns caused KeyError in
    CompletionTracker._validate_strategy() because they are not registered
    in the execution graph. Split into _instance_to_write_columns (buffer
    writes, includes side-effects) and _instance_to_columns (completion
    tracking, real columns only).
    
    * fix: warn on side-effect collision and clarify scheduler column maps
    
    Log a warning when multiple producers register the same side-effect
    column (first-writer-wins still applies). Rename _instance_to_columns
    and _instance_to_write_columns per review feedback for clarity.
    
    * fix: raise ConfigCompilationError on duplicate side-effect producers
    
    Replace first-writer-wins collision handling with a hard error.
    Each side-effect column must have exactly one producer; duplicates
    are a configuration issue to be fixed at the source.
    
    * fix: reject duplicate side-effect producers in sync DAG path
    
    Mirror the async path check: raise ConfigCompilationError when two
    custom columns declare the same side-effect column name during
    topological sort.
    andreatgretel authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    533a94b View commit details
    Browse the repository at this point in the history
  6. ci: add PR hygiene automation (linked issue check + stale PR cleanup) (

    …#521)
    
    * ci: add PR hygiene automation (linked issue check + stale PR cleanup)
    
    Add two workflows to enforce contribution quality and clean up abandoned PRs:
    
    - pr-linked-issue.yml: required status check that validates external PRs
      reference a triaged issue. Collaborators bypass. Re-triggers automatically
      when a maintainer adds the `triaged` label to the linked issue.
    
    - pr-stale.yml: daily cron that reminds authors of failing checks after 7/14
      days of inactivity and auto-closes after 14/28 days (external/collaborator).
      Respects `keep-open` label.
    
    New labels created: `triaged`, `task`, `keep-open`.
    
    Closes #518
    Signed-off-by: Andrea Manoel <amanoel@nvidia.com>
    
    * ci: add agentic repository triage workflow
    
    Add a weekly scheduled workflow that uses Claude to triage all open issues
    and PRs, producing a combined dashboard report on a pinned tracking issue.
    
    - New recipe (.agents/recipes/issue-triage/) classifies issues, checks
      staleness, cross-references merged PRs, detects duplicates, and flags
      PR health problems (missing linked issues, failing checks, orphaned PRs)
    - New workflow (.github/workflows/agentic-ci-issue-triage.yml) runs every
      Monday 10:00 UTC on the agentic-ci runner, with manual dispatch support
    - pr-stale.yml now adds needs-attention label to linked issues when a PR
      is auto-closed, bridging the two workflows via labels
    
    * docs: document stale PR policy and auto-retrigger in CONTRIBUTING.md
    
    * fix: address review findings in PR hygiene workflows
    
    - pr-linked-issue: fix comment gate so failure comments are posted
    - pr-stale: upgrade issues permission to write for labeling
    - pr-stale: compare reminder timestamp against last activity so
      push/comment actually resets the stale timer
    
    * fix: use --body-file in retrigger job to avoid shell quoting issues
    
    PR bodies with backticks or unmatched quotes would break the
    gh pr edit --body "$NEW_BODY" call. Write to a temp file and
    use --body-file instead.
    
    * fix: retrigger job drops PRs after the first
    
    jq outputs newline-separated numbers but GITHUB_OUTPUT only
    preserves the first line. Convert to space-separated so the
    for loop processes all matching PRs.
    
    * fix: harden workflows against shell injection
    
    - Move attacker-influenced values (${{ user.login }}, step outputs)
      from expression interpolation in run: blocks to env vars
    - Replace echo "$PR_BODY" | grep with write-to-file + grep-file
      to avoid shell expansion of untrusted PR body content
    - Same treatment for PR body handling in retrigger and stale jobs
    
    * refactor: replace peter-evans actions with gh api calls
    
    Remove peter-evans/find-comment and peter-evans/create-or-update-comment
    third-party action dependencies. Replace with gh api calls for finding,
    creating, updating, and deleting bot comments. Eliminates supply chain
    risk from unpinned third-party actions.
    
    * docs: add pull_request_target security comment
    
    ---------
    
    Signed-off-by: Andrea Manoel <amanoel@nvidia.com>
    andreatgretel authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    82c1a69 View commit details
    Browse the repository at this point in the history
  7. ci: bump the all-actions group with 5 updates (#539)

    * ci: bump the all-actions group with 5 updates
    
    Bumps the all-actions group with 5 updates:
    
    | Package | From | To |
    | --- | --- | --- |
    | [actions/checkout](https://github.com/actions/checkout) | `4.3.1` | `6.0.2` |
    | [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) | `7.6.0` | `8.0.0` |
    | [actions/download-artifact](https://github.com/actions/download-artifact) | `7.0.0` | `8.0.1` |
    | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `6.0.0` | `7.0.1` |
    | [NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml](https://github.com/nvidia-nemo/fw-ci-templates) | `0.65.12` | `0.88.1` |
    
    
    Updates `actions/checkout` from 4.3.1 to 6.0.2
    - [Release notes](https://github.com/actions/checkout/releases)
    - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
    - [Commits](actions/checkout@v4.3.1...de0fac2)
    
    Updates `astral-sh/setup-uv` from 7.6.0 to 8.0.0
    - [Release notes](https://github.com/astral-sh/setup-uv/releases)
    - [Commits](astral-sh/setup-uv@37802ad...cec2083)
    
    Updates `actions/download-artifact` from 7.0.0 to 8.0.1
    - [Release notes](https://github.com/actions/download-artifact/releases)
    - [Commits](actions/download-artifact@37930b1...3e5f45b)
    
    Updates `actions/upload-artifact` from 6.0.0 to 7.0.1
    - [Release notes](https://github.com/actions/upload-artifact/releases)
    - [Commits](actions/upload-artifact@b7c566a...043fb46)
    
    Updates `NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml` from 0.65.12 to 0.88.1
    - [Release notes](https://github.com/nvidia-nemo/fw-ci-templates/releases)
    - [Changelog](https://github.com/NVIDIA-NeMo/FW-CI-templates/blob/main/CHANGELOG.md)
    - [Commits](NVIDIA-NeMo/FW-CI-templates@21f18ae...2a49420)
    
    ---
    updated-dependencies:
    - dependency-name: actions/checkout
      dependency-version: 6.0.2
      dependency-type: direct:production
      update-type: version-update:semver-major
      dependency-group: all-actions
    - dependency-name: astral-sh/setup-uv
      dependency-version: 8.0.0
      dependency-type: direct:production
      update-type: version-update:semver-major
      dependency-group: all-actions
    - dependency-name: actions/download-artifact
      dependency-version: 8.0.1
      dependency-type: direct:production
      update-type: version-update:semver-major
      dependency-group: all-actions
    - dependency-name: actions/upload-artifact
      dependency-version: 7.0.1
      dependency-type: direct:production
      update-type: version-update:semver-major
      dependency-group: all-actions
    - dependency-name: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml
      dependency-version: 0.88.1
      dependency-type: direct:production
      update-type: version-update:semver-minor
      dependency-group: all-actions
    ...
    
    Signed-off-by: dependabot[bot] <support@github.com>
    
    * ci: skip docs preview deploy for Dependabot PRs
    
    GitHub does not expose repository secrets to Dependabot PRs, so the
    Cloudflare Pages deploy always fails with a missing API token. Skip the
    entire job when the actor is dependabot[bot].
    
    ---------
    
    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Andre Manoel <amanoel@nvidia.com>
    Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com>
    3 people authored Apr 13, 2026
    Configuration menu
    Copy the full SHA
    abe5c2d View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2026

  1. Configuration menu
    Copy the full SHA
    64f31bc View commit details
    Browse the repository at this point in the history
  2. docs: add text-to-sql dev note (#349)

    * docs: add text-to-sql devnote
    
    * add diagram, update content
    
    * correct inconsistencies
    
    * docs: address PR #349 feedback and add BIRD benchmark results
    PR feedback fixes:
    - Fix Window Functions contradiction: Key Takeaway #1 now uses
      "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate)
    - Fix score-0 truthiness bug: use `is not none` instead of truthy check
      in Jinja2 expression columns (inline example + production pipeline)
    - Soften Code Sandbox language: "A natural next step would be..." instead
      of "We are actively implementing..."
    - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron
      team description
    - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME,
      ASCII diagram labels, Pipeline Overview prose
    - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck
    - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default
      provider pattern (matches structured-outputs dev note), remove unused
      explicit ModelConfig
    - Remove placeholder dataset link (#), add "Dataset: Internal" note
    New content:
    - Add BIRD Benchmark Results section with bar chart (JPG), data table,
      BIRD caveat paragraph, and Jocelyn Huang acknowledgement
      (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B)
    - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps":
      Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0
    - Add Project Summary table at end of post
    
    * docs: address second round of PR #349 feedback
    
    - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
      to match the exact taxonomy string in the code example (greptile)
    - Add admonition clarifying code snippets are illustrative, not
      runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha)
    - Add context before score extraction snippet referencing the five
      LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
    - Add companion file note and recipe link to production pipeline
      details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)
    
    * docs: address round 2 PR #349 feedback, replace production block with recipe
    - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
      to match the exact taxonomy string in the code example (greptile)
    - Add admonition clarifying inline code snippets are illustrative,
      with link to runnable Enterprise Text-to-SQL Recipe (nabinchha)
    - Add context before score extraction snippet referencing the five
      LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
    - Replace production pipeline <details> block (230 lines with phantom
      imports from prompts.py, rubrics.py, text2sql_seed.json) with
      snippet include of enterprise_text_to_sql.py recipe — self-contained
      and runnable, consistent with other merged dev notes (nabinchha)
    
    * docs: polish Try It Yourself and Summary sections
    - Wrap minimal inline example in collapsible <details> dropdown
    - Rename "A Team Effort" section to "Summary"
    - Remove redundant Scale/Dialects/Dataset line
    
    * docs: add missing sql_dialect sampler to Step 1 code snippet
    
    The Step 3/4 prompt templates reference {{ sql_dialect }} but the
    Step 1 seeding code never defined it, leaving an unresolved Jinja2
    variable for readers following along. Add the sql_dialect sampler
    with a comment explaining the pipeline runs once per dialect.
    
    * fix ascii diagram
    
    * docs: fix BIRD score framing and MySQL dialect wording
    - Remove specific "60-70%" BIRD claim from intro to avoid contradiction
      with the 41.80%/38.25% direct-generation results shown later (those
      higher figures come from specialized systems with schema linking)
    - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and
      CONVERT_TZ are valid MySQL functions; the pipeline excluded them for
      portability, not because the dialect forbids them
    
    * docs: move text-to-sql images to assets/ convention and update refs
    
    * docs: address text-to-sql devnote review comments
    
      - Add devnote to mkdocs nav after Async All the Way Down
      - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe)
      - Fix score extraction truthy check to use 'is not none' (preserves score-0 values)
      - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect)
      - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y
      - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe)
    
    ---------
    
    Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
    Co-authored-by: Yev Meyer <ymeyer@nvidia.com>
    dhruvnathawani and 3mei authored Apr 14, 2026
    Configuration menu
    Copy the full SHA
    1448f9c View commit details
    Browse the repository at this point in the history
  3. fix: text-to-sql devnote date, images, and publish-devnotes nav (#546)

    - Update post date from 2026-03-11 to 2026-04-14 so it appears as the
      newest post on the devnotes page.
    - Replace raw <img> tags with markdown image syntax so mkdocs rewrites
      relative paths correctly for the blog plugin's slug-based URLs.
    - Overlay mkdocs.yml from HEAD in publish-devnotes workflow so new nav
      entries are included in devnotes-only rebuilds.
    andreatgretel authored Apr 14, 2026
    Configuration menu
    Copy the full SHA
    1a237d9 View commit details
    Browse the repository at this point in the history
  4. fix(ci): replace yq with Python nav patching in publish-devnotes (#548)

    The yq JSON roundtrip was mangling the entire mkdocs.yml file
    (indentation, quoting, comments), causing mike deploy to fail.
    
    Extract a Python script that surgically replaces only the Dev Notes
    nav block, leaving all other content byte-identical.
    andreatgretel authored Apr 14, 2026
    Configuration menu
    Copy the full SHA
    f267e19 View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2026

  1. feat: add skip.when conditional column generation (#502)

    * plan: add skip_when for conditional column generation (#479)
    
    Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
    that enables conditional column generation. When the Jinja2 expression
    evaluates truthy, the cell is set to None and the generator is skipped.
    Skips auto-propagate through the DAG to downstream columns.
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    
    * plan: remove HopChain example from skip_when plan
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    
    * plan: replace HopChain example with generic product review example
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    
    * plan: add open questions on skip sentinel value and row filtering
    
    Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    
    * plan: major revision — SkipConfig model, sync engine support, decouple propagation
    
    - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
    - Move propagate_skip to SingleColumnConfig as independent field, fixing
      bug where columns with no SkipConfig couldn't participate in propagation
    - Add full sync engine implementation (Steps 4a-4d) covering both
      _fan_out_with_threads and _run_full_column_generator dispatch paths
    - Add serialization boundary stripping for both DatasetBatchManager (sync)
      and RowGroupBufferManager (async)
    - Simplify architecture diagrams for readability
    - Update all references, design decisions, verification plan
    
    Made-with: Cursor
    
    * updates
    
    * plan: document get_required_columns for skip propagation
    
    - Explain why propagation must not use get_upstream_columns() once
      skip.when adds DAG edges; add _required_columns and
      get_required_columns() to the execution graph plan
    - Point async _run_cell at get_required_columns for parity with sync
    - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
      DataFrames; tighten resolved-questions wording
    - Extend DAG/graph verification with gating_col regression case
    
    Refs #479
    
    Made-with: Cursor
    
    * plan: centralize __skipped__ handling in skip_provenance
    
    - Document new skip_provenance.py (key constant, read/write/strip API)
    - Point sync builder, async scheduler, and batch buffers at shared helpers
    - Strip metadata before every DataFrame from buffer dicts, including
      FULL_COLUMN active subsets
    - Split §3 into skip_evaluator vs skip_provenance; extend verification
    
    Refs #479
    
    Made-with: Cursor
    
    * plan: align doc title with SkipConfig / skip.when
    
    Drop legacy skip_when naming in headings and #362 cross-reference.
    
    Refs #479
    
    Made-with: Cursor
    
    * plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
    
    - SkipConfig._validate_when_syntax now checks find_undeclared_variables
      is non-empty, rejecting expressions without {{ }} delimiters that
      would silently skip every row
    - evaluate_skip_when centralizes try/except so both sync and async
      engines get identical fail-safe behavior on eval errors
    - evaluate_skip_when takes a single pre-deserialized record; caller
      runs deserialize_json_values once and passes to both skip eval and
      generator (no double deserialization, no redundant parameter)
    - Update _should_skip_cell, async _run_cell, Files Modified table,
      and verification section accordingly
    
    Refs #479
    
    Made-with: Cursor
    
    * plan: add get_side_effect_columns accessor to execution graph spec
    
    Document _side_effects_by_producer inverse map and
    get_side_effect_columns() accessor on ExecutionGraph, needed by
    _write_skip_to_record / apply_skip_to_record to clear __trace,
    __reasoning_content, etc. on skip. Added to both Step 2b metadata
    section and Files Modified table.
    
    The __skipped__ leak into active_df (greptile's other P1) was already
    fixed in 7046378 via strip_skip_metadata_from_records.
    
    Refs #479
    
    Made-with: Cursor
    
    * add skip.when conditional column generation
    
    Introduce SkipConfig on SingleColumnConfig to gate column generation
    with a Jinja2 expression. Columns can be skipped by expression or by
    upstream propagation (propagate_skip flag).
    
    - SkipConfig: Pydantic model with config-time syntax/delimiter/variable
      validation and cached column extraction from the Jinja2 AST
    - skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
      with fail-safe error handling (skip on expected failures)
    - skip_provenance: centralized __skipped__ record tracking shared by
      sync builder, async scheduler, and buffer managers
    - DAG/ExecutionGraph: skip.columns wired as dependency edges in both
      topological sort and static execution graph
    - Validation: validate_skip_references checks reference existence,
      sampler/seed scope, and allow_resize conflicts
    - Sync builder: cell-by-cell and full-column skip with merge-back
    - Async scheduler: cell and batch skip with live-buffer provenance
    
    Made-with: Cursor
    
    * fix review findings for skip.when implementation
    
    - Add skip evaluation to _fan_out_with_async (was missing, causing
      skipped rows to still be sent to the LLM)
    - Preserve __skipped__ provenance on non-skipped records after
      full-column generation so multi-hop propagation works
    - Use single live-buffer reference in _run_batch skip loop for
      consistency with _run_cell
    - Move Template import to TYPE_CHECKING and reorder import blocks
    - Replace O(n²) sum() with itertools.chain in dag.py
    - Add set_required_columns/set_propagate_skip/set_skip_config
      setters to ExecutionGraph for symmetry with existing API
    
    Made-with: Cursor
    
    * add conditional generation with skip recipe and refactor skip helpers
    
    Add a new recipe demonstrating skip.when patterns (expression gate,
    propagation, opt-out) with a customer support ticket pipeline.
    
    Also extract _should_skip_record in async_scheduler, remove the
    redundant propagate_skip param from should_skip_by_propagation, and
    pass a precomputed all_side_effects set through the DAG sort.
    
    Made-with: Cursor
    
    * updates
    
    * fixes
    
    * remove recipe > inject conditional gen into existing tutorial
    
    * regen colab notebooks
    
    * fix: handle missing execution graph in _column_can_skip
    
    Return False when the graph has not been initialized instead of raising,
    since skip logic cannot apply before generators are set up.
    
    Made-with: Cursor
    
    * parametrize some tests
    
    * public before private
    
    * slight refactor for readability
    
    * parametrize some tests
    
    * minor fixes
    
    * reanme internla skip tracker key name
    
    * clarify intent in comment
    
    * when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
    
    * remove inline import
    
    * minor refactor for clarity
    
    * fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
    
    Two bugs in the sequential engine's _run_full_column_generator:
    
    1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
       three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
       fallthrough), breaking propagate_skip for downstream columns when an
       independent FULL_COLUMN generator ran between skip-setting and
       propagating columns.
    
    2. _column_can_skip returned True for allow_resize=True columns via
       propagation, causing the skip-aware merge path to raise on the 1:1
       row-count check for 1:N generators.
    
    - Add restore_skip_metadata helper to skip_tracker.py
    - Guard _column_can_skip against allow_resize=True columns
    - Refactor _run_full_column_generator into three focused methods
    - Remove dead allow_resize / _log_resize_if_changed from skip path
    - Remove redundant _require_graph() calls in skip helpers
    - Add single_column_config_by_name cached property
    - Add integration tests for both bugs and unit tests for the helper
    
    Made-with: Cursor
    
    * address review comments on skip.when PR (#502)
    
    - Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
      into should_skip_column_for_record() in skip_evaluator.py so both sync and
      async engines call the same function (andreatgretel review comment)
    - Extend SkipConfig self-reference validation to cover side-effect columns
      (e.g. review__trace on the review column) — previously only checked
      self.name, now checks self.name | self.side_effect_columns
    - Add async engine integration tests for skip paths: cell-by-cell with
      propagation and full-column batch skip (exercises _run_cell / _run_batch)
    - Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
      propagate_skip=True so it actually exercises the allow_resize guard
    - Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
      production consumer)
    
    Made-with: Cursor
    
    * address cr feedback
    
    * Fix issue with full column  generating messing up order of skipped rows
    
    * add skip conditional generation edge case tests
    
    - test_skip_evaluator: parametrized should_skip_column_for_record covering
      propagation, expression gates, short-circuiting, and disabled propagation
    - test_execution_graph: skip metadata accessors (get_skip_config,
      should_propagate_skip, get_required_columns, get_side_effect_columns,
      resolve_side_effect, skip.when DAG edges)
    - test_dataset_builder: chained transitive propagation (4 levels),
      two independent skip gates, custom skip.value, row count preservation
    
    Made-with: Cursor
    
    * fix: make expression jinja validator private
    
    Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
    to match the private naming convention used by other model validators.
    
    Made-with: Cursor
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
    nabinchha and claude authored Apr 15, 2026
    Configuration menu
    Copy the full SHA
    a9af365 View commit details
    Browse the repository at this point in the history
  2. fix: use pull_request_target for agentic CI on fork PRs (#541)

    * fix: use pull_request_target for agentic CI on fork PRs
    
    * fix: read recipe files from base branch to prevent prompt injection
    
    Recipe files define the agent's prompt. When using pull_request_target,
    the fork's HEAD is checked out, so a malicious fork could craft recipe
    files to exfiltrate API secrets via prompt injection. Fix by adding a
    second sparse checkout from the base branch for .agents/recipes/ and
    reading prompts from there instead of the fork tree.
    
    * fix: align actions/checkout version for base-recipes checkout
    
    Match the base-branch recipe checkout to v6.0.2 (same SHA as the PR
    branch checkout) for consistency.
    
    * fix: move expression interpolations to env vars in gate and review jobs
    
    Replace direct ${{ }} interpolation in run: blocks with env vars.
    Most values are GitHub-controlled, but github.event.label.name can
    contain arbitrary characters and could break shell quoting. Moving
    everything to env: is consistent with the injection-hardening pattern
    applied in the rest of the workflow.
    andreatgretel authored Apr 15, 2026
    Configuration menu
    Copy the full SHA
    6ef4953 View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2026

  1. docs: Added starter dev notes on push to hugging face hub (#355)

    * Added starter dev notes on push to huggingface hub
    
    * fix: move excerpt marker to intro and remove redundant markers
    
    Move the single <\!-- more --> to after the intro paragraph for a shorter
    blog teaser and remove the 6 redundant markers throughout the post.
    
    * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
    
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    
    * docs: add HF ecosystem context to push-to-hub dev notes (#474)
    
    * docs: add HF ecosystem context to push-to-hub dev notes
    
    Add section on what datasets get on the Hub (Dataset Viewer, streaming,
    Viewer API), link to Hub search for DataDesigner datasets, and note that
    private datasets can be flipped to public.
    
    * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md
    
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    
    * fix: remove doubled library: prefix in Hub search URL
    
    ---------
    
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    
    * Update date
    
    * fix date for text-to-sql
    
    * update hero images"
    
    * updates
    
    ---------
    
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
    Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
    3 people authored Apr 16, 2026
    Configuration menu
    Copy the full SHA
    cebfb0e View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2026

  1. fix: bridge model.generate() to agenerate() for custom columns in asy…

    …nc engine (#545)
    
    * feat: bridge model.generate() to agenerate() for custom columns in async engine
    
    Custom column generators that call model.generate() fail under the async
    engine because the sync HTTP client is unavailable. Add an
    _AsyncBridgedModelFacade proxy in _build_models_dict() that intercepts the
    sync-client RuntimeError and schedules agenerate() on the engine's persistent
    event loop via run_coroutine_threadsafe. Includes a deadlock guard for async
    custom columns running on the event loop.
    
    * refactor: wrap facades at sync call site, not in _build_models_dict
    
    Move _AsyncBridgedModelFacade wrapping from _build_models_dict() into
    _invoke_generator_function() so the async path gets raw facades. The
    bridge proxy is only needed for sync custom columns; async columns
    already have direct access to model.agenerate().
    
    * fix: address review feedback - typed exception, timeout cleanup, kwargs test
    
    - Introduce SyncClientUnavailableError so the facade catches by type
      instead of matching error strings (review comment #1)
    - Add future.cancel() + logger.warning() on timeout to match the
      _run_coroutine_sync pattern in base.py (review comment #2)
    - Assert kwargs forwarding in the async bridge test (review comment #4)
    
    * fix: let SyncClientUnavailableError propagate through @catch_llm_exceptions
    
    The decorator catches all exceptions and wraps them into DataDesignerError,
    which prevented the async bridge proxy from ever seeing the original error.
    Add an early match case that re-raises SyncClientUnavailableError directly.
    
    * refactor: make SYNC_BRIDGE_TIMEOUT a public constant
    
    Drop the underscore prefix since the constant is exported and used
    across modules (base.py and custom.py).
    andreatgretel authored Apr 17, 2026
    Configuration menu
    Copy the full SHA
    a965bc1 View commit details
    Browse the repository at this point in the history
  2. ci: add daily audit suites with 5 rotating recipes and scheduled work…

    …flow (#543)
    
    * ci: add daily audit suites with 5 recipes and scheduled workflow
    
    Add the daily maintenance infrastructure (Phase 2+3 of the agentic CI
    plan). A new workflow runs one audit suite per weekday via day-of-week
    rotation, with runner memory persisted via actions/cache.
    
    Recipes: docs-and-references (Mon), dependencies (Tue), structure (Wed),
    code-quality (Thu), test-health (Fri). Each targets gaps that CI and ruff
    don't cover: cross-reference validation, transitive dep analysis, lazy
    import compliance, complexity trends, and test-to-source mapping.
    
    Reports go to the Actions step summary. Code changes use /create-pr.
    
    * ci: add executable smoke checks and harden runner memory
    
    Add executable smoke checks to test-health and code-quality recipes
    that exercise real code paths (config build, validate, import timing,
    registry completeness, error hierarchy, input rejection) without
    needing an LLM provider. Checks are split into fixed canaries (same
    every run) and creative checks (agent varies inputs each run).
    
    Harden runner memory: define JSON schema in _runner.md with TTL and
    size rules, validate state file after agent runs, only update
    last_run on success, drop unused audit-log.md. Add make install-dev
    workflow step so recipes can run Python against the installed packages.
    
    * ci: fix codex review findings - test paths, provider check, step gating
    
    Fix issues found by Codex review:
    - Fix test paths: tests/ does not exist at repo root, use
      packages/*/tests/ and packages/data-designer/tests/test_import_perf.py
    - Remove DataDesigner(model_providers=[]) from smoke checks - raises
      NoModelProvidersError; keep config-layer checks only
    - Fix audit step gating: remove continue-on-error, use step outcome
      to gate runner memory update (|| true + continue-on-error made the
      step always "succeed", defeating the success() condition)
    
    * ci: fix review findings - heredoc, state validation, lazy import wording
    
    Fix heredoc with indented EOF terminator that never terminates - replace
    with printf. Run state validation on all outcomes (not just success) so
    corrupted state from a failed audit is caught before caching. Only stamp
    last_run when audit succeeds. Align test-health lazy import section with
    its own Constraints (report count only, don't duplicate structure audit).
    
    Also fixes datetime.utcnow() deprecation and shell variable injection
    in Python string by using os.environ instead.
    andreatgretel authored Apr 17, 2026
    Configuration menu
    Copy the full SHA
    b220f36 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8be4ff7 View commit details
    Browse the repository at this point in the history
Loading