docs: plan audio/video context support by nabinchha · Pull Request #669 · NVIDIA-NeMo/DataDesigner

nabinchha · 2026-05-15T21:24:42Z

📋 Summary

Adds a design plan for supporting audio and video as multimodal context, following the existing image-context pattern. The plan keeps the issue scope limited to design work while clarifying config boundaries, provider translation responsibilities, and test coverage for a future implementation.

🔗 Related Issue

Closes #668

🔄 Changes

Adds plans/668/audio-video-context.md with the proposed AudioContext / VideoContext API shape.
Clarifies that audio/video context values are URL or base64 only, with local path handling and provider file metadata out of scope for v1.
Documents provider translation boundaries, unsupported-provider behavior, implementation phases, and test strategy.

🧪 Testing

make test passes (not run; markdown-only planning change)
Unit tests added/updated (N/A; no testable logic)
E2E tests added/updated (N/A; no testable logic)
Commit hooks passed for markdown-safe checks

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (plan added under plans/668/)

Closes #668 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>

github-actions · 2026-05-15T21:26:41Z

Review: PR #669 — `docs: plan audio/video context support`

Summary

A plan-only PR that adds plans/668/audio-video-context.md (275 lines, no code). It proposes a declarative AudioContext / VideoContext API mirroring the existing ImageContext pattern, a canonical media-block schema as the boundary between config and adapters, provider translation rules, and a 3-PR rollout. The scope is constrained to media-as-input-context — no generation, no heavy deps in config import paths, no local-path handling for audio/video in v1.

Findings

Strengths

Architectural alignment is solid. The plan explicitly preserves the interface → engine → config import direction and the "declare, don't orchestrate" contract from AGENTS.md. The canonical media-block schema lives between config (descriptive) and adapters (provider-specific), which keeps OpenAI/Anthropic shapes from leaking back into config — consistent with the existing principle that "errors normalize at boundaries."
Conservative non-goals. Excluding ffmpeg/moviepy from config imports, deferring frame-extraction fallbacks, deferring the Responses API migration, and deferring provider file-upload lifecycle are all the right calls for v1 and prevent scope creep.
Clear adapter responsibility split. Image stays unchanged on Anthropic; audio/video raise canonical unsupported errors before transport. OpenAI-compatible owns image_url / input_audio translation. This is the right place for capability gating.
Good rollout atomicity. Three PRs (config foundation → engine/adapters → docs) match the package layering and let the config layer ship and be tested independently.
Risks section is honest about video provider variance, base64 size blowup, and the silent-download trap for remote URLs.

Gaps and ambiguities

Discriminated-union migration of ImageContext is under-specified. Verified locally: ImageContext.modality is currently Modality = Modality.IMAGE (not Literal). Step 1 says to "change concrete context modality fields to Literal[...] values." That's a schema change on an existing public type. The plan should call out:
- whether previously serialized configs (which may omit modality and rely on the default) still round-trip — Pydantic discriminated unions typically require the discriminator key to be present;
- whether the column-config field type change from list[ImageContext] | None to list[MultiModalContextT] | None is a breaking type-narrowing for downstream type-checkers or plugin authors who annotated against ImageContext.
A one-line backward-compat note in the Design Decisions table (alongside the existing "legacy image_url" entry) would close this.
data_type=None auto-detection behavior for audio/video is unspecified. ImageContext auto-detects URL vs base64 vs local path when data_type is omitted. The plan says audio/video are URL-or-base64 only, but doesn't say whether data_type is required for the new contexts or whether auto-detection is supported (URL vs base64). Either choice is fine, but the plan should pick one — otherwise implementers will diverge. Test Plan mentions "explicit base64 requires an explicit format" but not the data_type=None case.
Canonical error type isn't named. Step 5 and the Test Plan reference "canonical unsupported-capability provider error" but don't say whether this is an existing class in clients/errors.py or a new one (e.g., UnsupportedModalityError). Naming it in the plan avoids ad-hoc choices in PR 2.
Missing: schema-export impact. DataDesigner exports JSON schema for configs (used by docs/UI). Switching multi_modal_context to a discriminated union changes the emitted schema shape. Worth a one-liner in the Test Plan ("schema export snapshot updated and reviewed") or Risks.
Step 2's "extract shared helpers" lacks a deprecation policy. The plan says to move reusable normalization out of image_helpers.py into media_helpers.py, but doesn't say whether the old image-helper public surface is preserved, deprecated, or re-exported. If anything imports from image_helpers.py outside config (or in plugins), this matters.
MIME / format mapping is only sketched. The Canonical Block Shape gives audio/mpeg for mp3 and video/mp4 for mp4, but Step 2 just says "Add MIME helpers for audio/video formats." A small explicit table (format enum → media_type → OpenAI format value) would prevent inconsistency between Step 2 and Step 5.
Open question worth surfacing: list-valued media columns. The plan supports list/JSON-list values (consistent with ImageContext), but for audio/video this can mean N base64 blobs in one user message. Provider per-message size limits and the order of multi-block flattening should be stated, even if just as "blocks are emitted in list order, no dedup."

Minor

The Mermaid diagram is correct but the engine-side label _build_multi_modal_context lives on ColumnGeneratorWithModel — fine as-is; the rendering will be tight on narrow viewers.
"OpenAI-compatible adapter forwards image_url blocks unchanged" in Current State is accurate; it might be worth noting in Step 5 that NIM and other OpenAI-compatible providers may not all accept input_audio, so the audio path may need a per-endpoint capability flag rather than a flat "OpenAI-compatible supports audio."
Provider links use stable docs URLs — good. Anthropic's vision URL pinned to platform.claude.com is the current canonical host.

Verdict

Approve with revisions. The plan is well-scoped, architecturally aligned, and rolls out cleanly across the three packages. Before PR 1 lands, the plan should be tightened on: (a) the ImageContext modality field migration and config backward compatibility, (b) data_type=None auto-detection policy for the new contexts, (c) the named canonical error class, and (d) explicit MIME / format mapping. None of these block the design itself; they're loose ends that will otherwise be re-litigated in the implementation PRs.

greptile-apps · 2026-05-15T21:28:38Z

Greptile Summary

This PR adds a design plan (plans/668/audio-video-context.md) for first-class AudioContext and VideoContext support in DataDesigner, following the existing ImageContext pattern. All three previously raised review concerns (ImageColumnConfig scope, canonical audio block format field, and ImageContext.get_contexts() migration path) have been resolved in prior commits.

Proposes AudioContext/VideoContext config objects for text-generation columns only, with LLMTextColumnConfig.multi_modal_context broadened to a discriminated union while ImageColumnConfig remains image-only.
Defines a uniform canonical block shape across all three modalities; provider-specific fields (e.g., OpenAI's input_audio.format) are derived by adapters, not stored in config.
Documents the ImageContext.get_contexts() canonical migration, legacy-dict pre-validator strategy, data_type=None auto-detection rules for audio/video, and adapter capability gates for unsupported modality/provider combinations.

Confidence Score: 5/5

Documentation-only planning PR with no executable code; safe to merge.

The change is a single markdown design document. All three findings from the previous review round have been resolved in prior commits: ImageColumnConfig.multi_modal_context stays image-only, canonical audio blocks carry only media_type, and ImageContext.get_contexts() migration to canonical blocks is explicitly specified. No logic errors, internal contradictions, or unresolved design gaps remain.

No files require special attention.

Important Files Changed

Filename	Overview
plans/668/audio-video-context.md	New design document; internally consistent, all three prior review findings are resolved, no logic errors or contradictions found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    user["User Config\nLLMTextColumnConfig\nmulti_modal_context: list[MultiModalContextT]"]
    image["ImageContext\n(unchanged public API)\nget_contexts() → canonical image block"]
    audio["AudioContext\n(new)\nget_contexts() → canonical audio block"]
    video["VideoContext\n(new)\nget_contexts() → canonical video block"]
    generator["ColumnGeneratorWithModel\n_build_multi_modal_context()\nbase_path passed to ImageContext only"]
    prompt["prompt_to_messages()\ncanonical blocks prepended to text"]
    gate["Adapter Capability Gate\ncheck modality + source type + media_type"]
    openai["OpenAICompatibleClient\nimage → image_url\naudio base64 → input_audio (format derived)\nvideo → provider part or unsupported error"]
    anthropic["AnthropicClient\nimage → Claude image block\naudio/video → unsupported-capability error"]
    provider["Provider API"]

    user --> image
    user --> audio
    user --> video
    image --> generator
    audio --> generator
    video --> generator
    generator --> prompt
    prompt --> gate
    gate --> openai
    gate --> anthropic
    openai --> provider
    anthropic --> provider

_{Reviews (7): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile}

andreatgretel · 2026-05-18T20:43:09Z

+- OpenAI-compatible:
+  - Translate canonical `image` blocks to `image_url`.
+  - Translate canonical base64 `audio` blocks to `input_audio`.
+  - Translate supported `video` blocks to provider-specific media content parts only when the provider route supports them.


I think this needs one more concrete capability gate before implementation. Today DataDesigner only knows broad adapter capabilities like chat/image/embedding, but this plan needs modality/source/format decisions, like whether this provider/model accepts audio URL or video base64. Without that, unsupported media will likely fall through to raw provider 400s instead of the canonical unsupported error the plan is aiming for.

Addressed in b152eca: added an adapter-level capability gate for modality, source type, and media type before transport, raising ProviderError.unsupported_capability / ProviderErrorKind.UNSUPPORTED_CAPABILITY for unsupported combinations.

Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout. Refs #668

greptile-apps · 2026-05-19T19:13:26Z

+
+### 4. Keep engine prompt plumbing generic
+
+Files:


ImageContext.get_contexts() migration is unresolved

Implementation step 1 says "Preserve ImageContext behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical {"type": "image", "source": {...}} blocks — not the {"type": "image_url", ...} dicts that ImageContext.get_contexts() currently emits. Step 5 then lists "Translate canonical image blocks to image_url" as an explicit adapter responsibility.

If ImageContext.get_contexts() is not updated to emit canonical blocks, the OpenAI adapter's image-block translation path is dead code and the existing image-context flow silently breaks. If it is updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the image_url dict format will need updates — contradicting "existing image-context tests remain green."

The plan needs to resolve this explicitly: either commit to migrating ImageContext.get_contexts() to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep ImageContext emitting image_url blocks and document that adapters will dual-dispatch on both image_url and canonical image block shapes.

Prompt To Fix With AI

This is a comment left during a code review. Path: plans/668/audio-video-context.md Line: 218 Comment: **`ImageContext.get_contexts()` migration is unresolved** Implementation step 1 says "Preserve `ImageContext` behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical `{"type": "image", "source": {...}}` blocks — not the `{"type": "image_url", ...}` dicts that `ImageContext.get_contexts()` currently emits. Step 5 then lists "Translate canonical `image` blocks to `image_url`" as an explicit adapter responsibility. If `ImageContext.get_contexts()` is **not** updated to emit canonical blocks, the OpenAI adapter's `image`-block translation path is dead code and the existing image-context flow silently breaks. If it **is** updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the `image_url` dict format will need updates — contradicting "existing image-context tests remain green." The plan needs to resolve this explicitly: either commit to migrating `ImageContext.get_contexts()` to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep `ImageContext` emitting `image_url` blocks and document that adapters will dual-dispatch on both `image_url` and canonical `image` block shapes. How can I resolve this? If you propose a fix, please make it concise.

Addressed in 57f92eb: the plan now explicitly chooses the canonical migration path for ImageContext.get_contexts(). ImageContext keeps its public API/user-visible behavior, but get_contexts() emits canonical image blocks; adapters keep accepting legacy image_url blocks during the transition, and tests that inspect pre-adapter blocks should be updated to expect the canonical shape.

docs: plan audio/video context support

bffe77f

Closes #668 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>

nabinchha requested a review from a team as a code owner May 15, 2026 21:24

nabinchha temporarily deployed to agentic-ci May 15, 2026 21:24 — with GitHub Actions Inactive

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread plans/668/audio-video-context.md

Comment thread plans/668/audio-video-context.md

nabinchha mentioned this pull request May 15, 2026

Implement audio and video context support #671

Closed

github-actions Bot mentioned this pull request May 18, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread plans/668/audio-video-context.md Outdated

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread plans/668/audio-video-context.md Outdated

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread plans/668/audio-video-context.md

andreatgretel reviewed May 18, 2026

View reviewed changes

Comment thread plans/668/audio-video-context.md Outdated

nabinchha added 3 commits May 18, 2026 15:17

docs: address audio video context plan feedback

b152eca

Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout. Refs #668

docs: clarify legacy image context migration

7fadf4d

Merge branch 'main' into nmulepati/feat-support-audio-video-context

654c945

nabinchha requested a review from andreatgretel May 18, 2026 21:29

Merge branch 'main' into nmulepati/feat-support-audio-video-context

66934a4

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

docs: resolve image context canonical migration

57f92eb

andreatgretel approved these changes May 20, 2026

View reviewed changes

Merge branch 'main' into nmulepati/feat-support-audio-video-context

9036993

nabinchha merged commit e181b4b into main May 20, 2026
50 checks passed

nabinchha deleted the nmulepati/feat-support-audio-video-context branch May 20, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: plan audio/video context support#669

docs: plan audio/video context support#669
nabinchha merged 7 commits into
mainfrom
nmulepati/feat-support-audio-video-context

nabinchha commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 15, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

Uh oh!

Uh oh!

andreatgretel May 18, 2026

Uh oh!

nabinchha May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

nabinchha May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nabinchha commented May 15, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 15, 2026

Review: PR #669 — docs: plan audio/video context support

Summary

Findings

Strengths

Gaps and ambiguities

Minor

Verdict

Uh oh!

greptile-apps Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

andreatgretel May 18, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Review: PR #669 — `docs: plan audio/video context support`

greptile-apps Bot commented May 15, 2026 •

edited

Loading