Skip to content

docs: plan audio/video context support#669

Merged
nabinchha merged 7 commits into
mainfrom
nmulepati/feat-support-audio-video-context
May 20, 2026
Merged

docs: plan audio/video context support#669
nabinchha merged 7 commits into
mainfrom
nmulepati/feat-support-audio-video-context

Conversation

@nabinchha

Copy link
Copy Markdown
Contributor

📋 Summary

Adds a design plan for supporting audio and video as multimodal context, following the existing image-context pattern. The plan keeps the issue scope limited to design work while clarifying config boundaries, provider translation responsibilities, and test coverage for a future implementation.

🔗 Related Issue

Closes #668

🔄 Changes

  • Adds plans/668/audio-video-context.md with the proposed AudioContext / VideoContext API shape.
  • Clarifies that audio/video context values are URL or base64 only, with local path handling and provider file metadata out of scope for v1.
  • Documents provider translation boundaries, unsupported-provider behavior, implementation phases, and test strategy.

🧪 Testing

  • make test passes (not run; markdown-only planning change)
  • Unit tests added/updated (N/A; no testable logic)
  • E2E tests added/updated (N/A; no testable logic)
  • Commit hooks passed for markdown-safe checks

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (plan added under plans/668/)

Closes #668

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner May 15, 2026 21:24
@github-actions

Copy link
Copy Markdown
Contributor

Review: PR #669docs: plan audio/video context support

Summary

A plan-only PR that adds plans/668/audio-video-context.md (275 lines, no code). It proposes a declarative AudioContext / VideoContext API mirroring the existing ImageContext pattern, a canonical media-block schema as the boundary between config and adapters, provider translation rules, and a 3-PR rollout. The scope is constrained to media-as-input-context — no generation, no heavy deps in config import paths, no local-path handling for audio/video in v1.

Findings

Strengths

  • Architectural alignment is solid. The plan explicitly preserves the interface → engine → config import direction and the "declare, don't orchestrate" contract from AGENTS.md. The canonical media-block schema lives between config (descriptive) and adapters (provider-specific), which keeps OpenAI/Anthropic shapes from leaking back into config — consistent with the existing principle that "errors normalize at boundaries."
  • Conservative non-goals. Excluding ffmpeg/moviepy from config imports, deferring frame-extraction fallbacks, deferring the Responses API migration, and deferring provider file-upload lifecycle are all the right calls for v1 and prevent scope creep.
  • Clear adapter responsibility split. Image stays unchanged on Anthropic; audio/video raise canonical unsupported errors before transport. OpenAI-compatible owns image_url / input_audio translation. This is the right place for capability gating.
  • Good rollout atomicity. Three PRs (config foundation → engine/adapters → docs) match the package layering and let the config layer ship and be tested independently.
  • Risks section is honest about video provider variance, base64 size blowup, and the silent-download trap for remote URLs.

Gaps and ambiguities

  1. Discriminated-union migration of ImageContext is under-specified. Verified locally: ImageContext.modality is currently Modality = Modality.IMAGE (not Literal). Step 1 says to "change concrete context modality fields to Literal[...] values." That's a schema change on an existing public type. The plan should call out:

    • whether previously serialized configs (which may omit modality and rely on the default) still round-trip — Pydantic discriminated unions typically require the discriminator key to be present;
    • whether the column-config field type change from list[ImageContext] | None to list[MultiModalContextT] | None is a breaking type-narrowing for downstream type-checkers or plugin authors who annotated against ImageContext.

    A one-line backward-compat note in the Design Decisions table (alongside the existing "legacy image_url" entry) would close this.

  2. data_type=None auto-detection behavior for audio/video is unspecified. ImageContext auto-detects URL vs base64 vs local path when data_type is omitted. The plan says audio/video are URL-or-base64 only, but doesn't say whether data_type is required for the new contexts or whether auto-detection is supported (URL vs base64). Either choice is fine, but the plan should pick one — otherwise implementers will diverge. Test Plan mentions "explicit base64 requires an explicit format" but not the data_type=None case.

  3. Canonical error type isn't named. Step 5 and the Test Plan reference "canonical unsupported-capability provider error" but don't say whether this is an existing class in clients/errors.py or a new one (e.g., UnsupportedModalityError). Naming it in the plan avoids ad-hoc choices in PR 2.

  4. Missing: schema-export impact. DataDesigner exports JSON schema for configs (used by docs/UI). Switching multi_modal_context to a discriminated union changes the emitted schema shape. Worth a one-liner in the Test Plan ("schema export snapshot updated and reviewed") or Risks.

  5. Step 2's "extract shared helpers" lacks a deprecation policy. The plan says to move reusable normalization out of image_helpers.py into media_helpers.py, but doesn't say whether the old image-helper public surface is preserved, deprecated, or re-exported. If anything imports from image_helpers.py outside config (or in plugins), this matters.

  6. MIME / format mapping is only sketched. The Canonical Block Shape gives audio/mpeg for mp3 and video/mp4 for mp4, but Step 2 just says "Add MIME helpers for audio/video formats." A small explicit table (format enum → media_type → OpenAI format value) would prevent inconsistency between Step 2 and Step 5.

  7. Open question worth surfacing: list-valued media columns. The plan supports list/JSON-list values (consistent with ImageContext), but for audio/video this can mean N base64 blobs in one user message. Provider per-message size limits and the order of multi-block flattening should be stated, even if just as "blocks are emitted in list order, no dedup."

Minor

  • The Mermaid diagram is correct but the engine-side label _build_multi_modal_context lives on ColumnGeneratorWithModel — fine as-is; the rendering will be tight on narrow viewers.
  • "OpenAI-compatible adapter forwards image_url blocks unchanged" in Current State is accurate; it might be worth noting in Step 5 that NIM and other OpenAI-compatible providers may not all accept input_audio, so the audio path may need a per-endpoint capability flag rather than a flat "OpenAI-compatible supports audio."
  • Provider links use stable docs URLs — good. Anthropic's vision URL pinned to platform.claude.com is the current canonical host.

Verdict

Approve with revisions. The plan is well-scoped, architecturally aligned, and rolls out cleanly across the three packages. Before PR 1 lands, the plan should be tightened on: (a) the ImageContext modality field migration and config backward compatibility, (b) data_type=None auto-detection policy for the new contexts, (c) the named canonical error class, and (d) explicit MIME / format mapping. None of these block the design itself; they're loose ends that will otherwise be re-litigated in the implementation PRs.

@greptile-apps

greptile-apps Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a design plan (plans/668/audio-video-context.md) for first-class AudioContext and VideoContext support in DataDesigner, following the existing ImageContext pattern. All three previously raised review concerns (ImageColumnConfig scope, canonical audio block format field, and ImageContext.get_contexts() migration path) have been resolved in prior commits.

  • Proposes AudioContext/VideoContext config objects for text-generation columns only, with LLMTextColumnConfig.multi_modal_context broadened to a discriminated union while ImageColumnConfig remains image-only.
  • Defines a uniform canonical block shape across all three modalities; provider-specific fields (e.g., OpenAI's input_audio.format) are derived by adapters, not stored in config.
  • Documents the ImageContext.get_contexts() canonical migration, legacy-dict pre-validator strategy, data_type=None auto-detection rules for audio/video, and adapter capability gates for unsupported modality/provider combinations.

Confidence Score: 5/5

Documentation-only planning PR with no executable code; safe to merge.

The change is a single markdown design document. All three findings from the previous review round have been resolved in prior commits: ImageColumnConfig.multi_modal_context stays image-only, canonical audio blocks carry only media_type, and ImageContext.get_contexts() migration to canonical blocks is explicitly specified. No logic errors, internal contradictions, or unresolved design gaps remain.

No files require special attention.

Important Files Changed

Filename Overview
plans/668/audio-video-context.md New design document; internally consistent, all three prior review findings are resolved, no logic errors or contradictions found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    user["User Config\nLLMTextColumnConfig\nmulti_modal_context: list[MultiModalContextT]"]
    image["ImageContext\n(unchanged public API)\nget_contexts() → canonical image block"]
    audio["AudioContext\n(new)\nget_contexts() → canonical audio block"]
    video["VideoContext\n(new)\nget_contexts() → canonical video block"]
    generator["ColumnGeneratorWithModel\n_build_multi_modal_context()\nbase_path passed to ImageContext only"]
    prompt["prompt_to_messages()\ncanonical blocks prepended to text"]
    gate["Adapter Capability Gate\ncheck modality + source type + media_type"]
    openai["OpenAICompatibleClient\nimage → image_url\naudio base64 → input_audio (format derived)\nvideo → provider part or unsupported error"]
    anthropic["AnthropicClient\nimage → Claude image block\naudio/video → unsupported-capability error"]
    provider["Provider API"]

    user --> image
    user --> audio
    user --> video
    image --> generator
    audio --> generator
    video --> generator
    generator --> prompt
    prompt --> gate
    gate --> openai
    gate --> anthropic
    openai --> provider
    anthropic --> provider
Loading

Reviews (7): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile

Comment thread plans/668/audio-video-context.md
Comment thread plans/668/audio-video-context.md
- OpenAI-compatible:
- Translate canonical `image` blocks to `image_url`.
- Translate canonical base64 `audio` blocks to `input_audio`.
- Translate supported `video` blocks to provider-specific media content parts only when the provider route supports them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs one more concrete capability gate before implementation. Today DataDesigner only knows broad adapter capabilities like chat/image/embedding, but this plan needs modality/source/format decisions, like whether this provider/model accepts audio URL or video base64. Without that, unsupported media will likely fall through to raw provider 400s instead of the canonical unsupported error the plan is aiming for.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b152eca: added an adapter-level capability gate for modality, source type, and media type before transport, raising ProviderError.unsupported_capability / ProviderErrorKind.UNSUPPORTED_CAPABILITY for unsupported combinations.

Comment thread plans/668/audio-video-context.md Outdated
Comment thread plans/668/audio-video-context.md Outdated
Comment thread plans/668/audio-video-context.md
Comment thread plans/668/audio-video-context.md Outdated
nabinchha added 3 commits May 18, 2026 15:17
Tighten the plan around legacy image-context migration, audio/video auto-detection, config-layer canonical blocks, capability gating, ImageColumnConfig scope, and single-PR implementation rollout.

Refs #668
@nabinchha nabinchha requested a review from andreatgretel May 18, 2026 21:29

### 4. Keep engine prompt plumbing generic

Files:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ImageContext.get_contexts() migration is unresolved

Implementation step 1 says "Preserve ImageContext behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical {"type": "image", "source": {...}} blocks — not the {"type": "image_url", ...} dicts that ImageContext.get_contexts() currently emits. Step 5 then lists "Translate canonical image blocks to image_url" as an explicit adapter responsibility.

If ImageContext.get_contexts() is not updated to emit canonical blocks, the OpenAI adapter's image-block translation path is dead code and the existing image-context flow silently breaks. If it is updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the image_url dict format will need updates — contradicting "existing image-context tests remain green."

The plan needs to resolve this explicitly: either commit to migrating ImageContext.get_contexts() to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep ImageContext emitting image_url blocks and document that adapters will dual-dispatch on both image_url and canonical image block shapes.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/668/audio-video-context.md
Line: 218

Comment:
**`ImageContext.get_contexts()` migration is unresolved**

Implementation step 1 says "Preserve `ImageContext` behavior and imports" and the regression checklist says "Existing image-context tests remain green," but the canonical block architecture requires adapters to translate canonical `{"type": "image", "source": {...}}` blocks — not the `{"type": "image_url", ...}` dicts that `ImageContext.get_contexts()` currently emits. Step 5 then lists "Translate canonical `image` blocks to `image_url`" as an explicit adapter responsibility.

If `ImageContext.get_contexts()` is **not** updated to emit canonical blocks, the OpenAI adapter's `image`-block translation path is dead code and the existing image-context flow silently breaks. If it **is** updated, then "preserve behavior" is misleading and existing adapter/engine tests that inspect the `image_url` dict format will need updates — contradicting "existing image-context tests remain green."

The plan needs to resolve this explicitly: either commit to migrating `ImageContext.get_contexts()` to the canonical shape (and update the regression note to say "existing tests updated to expect canonical blocks"), or keep `ImageContext` emitting `image_url` blocks and document that adapters will dual-dispatch on both `image_url` and canonical `image` block shapes.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 57f92eb: the plan now explicitly chooses the canonical migration path for ImageContext.get_contexts(). ImageContext keeps its public API/user-visible behavior, but get_contexts() emits canonical image blocks; adapters keep accepting legacy image_url blocks during the transition, and tests that inspect pre-adapter blocks should be updated to expect the canonical shape.

@nabinchha nabinchha merged commit e181b4b into main May 20, 2026
50 checks passed
@nabinchha nabinchha deleted the nmulepati/feat-support-audio-video-context branch May 20, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan audio and video context support

2 participants