feat: add image generation support with multi-modal context#317
Conversation
…eInferenceParameters, EmbeddingInferenceParameters
…th BaseInferenceParameters
…lved based on the type of InferenceParameters
|
Curious regarding default models - should we add |
andreatgretel
left a comment
There was a problem hiding this comment.
Left a few more nits but overall everything looks good! Tried it out locally, tutorial runs fine. Excited about generating images on Data Designer 🖼️
Yes, may be but perhaps in a different PR! I don't see many options on build.nvidia.com that work with the standard nvidia endpoint .... |
|
|
||
| # Handle list of images | ||
| if isinstance(image_data, list): | ||
| previews = [] |
There was a problem hiding this comment.
nit: feels like we could use some kind of image previews abstraction, where a lot of the below logic can live. can be in the display_sample_record follow up
There was a problem hiding this comment.
Yes that + things in visualization.py can probably be broken down
| return result | ||
|
|
||
|
|
||
| class ImageInferenceParams(BaseInferenceParams): |
There was a problem hiding this comment.
Is it somewhat problematic that intellisense will show all the other parameters as well? Wondering if need a more striped doen base class.
There was a problem hiding this comment.
Hmm weird. Could you share a screenshot? We do want to inherit everything from BaseInferenceParams though
There was a problem hiding this comment.
Oh, that's correct. We do want timeout and max_parallel_requests. It's just that any image generation params like size, height, width, etc will need to go into extra_body because they vary per model
| # Validate required columns | ||
| missing_columns = list(set(self.config.required_columns) - set(data.keys())) | ||
| if len(missing_columns) > 0: | ||
| error_msg = ( | ||
| f"There was an error preparing the Jinja2 expression template. " | ||
| f"The following columns {missing_columns} are missing!" | ||
| ) | ||
| raise ValueError(error_msg) |
There was a problem hiding this comment.
i'm surprised we haven't centralized this check!
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
another note to our future selves that agents should always run make update-license-headers instead of generating these. once they are generated, we treat the existing years as the source of truth.
| except Exception as e: | ||
| raise HuggingFaceHubClientUploadError(f"Failed to upload parquet files: {e}") from e | ||
|
|
||
| def _upload_images_folder(self, repo_id: str, images_folder: Path) -> None: |
There was a problem hiding this comment.
do you know if we can have the images appear in the dataset viewer on HF? i've seen datasets that do, but not sure how they are formatted
There was a problem hiding this comment.
ohhh that's a good idea. I might open up a follow up PR for this.
johnnygreco
left a comment
There was a problem hiding this comment.
Thanks @nabinchha – this is awesome!
Add agenerate_image(), _agenerate_image_chat_completion(), and _agenerate_image_diffusion() async methods mirroring the sync generate_image() added in #317. The chat completion path uses acompletion(), the diffusion path uses router.aimage_generation(). Includes 5 new tests covering both paths, error cases, and usage tracking. Also fixes F821 lint errors for type annotations. Co-Authored-By: Remi <noreply@anthropic.com>

📋 Summary
Adds native image generation capabilities to DataDesigner, enabling synthetic image generation using diffusion and auto-regressive image generation models. Supports both standalone image generation and multi-modal context (using previously generated text/images as input), with robust storage management and comprehensive testing.
🏗️ Architecture
Key Design Decisions:
Auto-detection of API type:
generate_image()automatically routes to the correct LiteLLM API:image_generationAPIcompletionAPIMulti-modal context: Images can reference previously generated columns (text or images) using
multi_modal_contextfor image-to-image generationDual storage modes:
🔄 Changes
✨ Added
New Files - Core Implementation:
image.py(80 lines) -ImageCellGeneratorwith Jinja2 prompt rendering and multi-modal context resolutionmedia_storage.py(137 lines) -MediaStorageclass with DISK/DATAFRAME storage modesimage_helpers.py(238 lines) - Base64/PIL conversion, validation, format detection, diffusion model detectionNew Files - Documentation & Tests:
5-generating-images.py(296 lines) - Complete tutorial with examplestest_image.py(218 lines) - Image generator teststest_media_storage.py(228 lines) - Storage teststest_image_helpers.py(349 lines) - Utility testsConfiguration Classes:
ImageColumnConfig- Column config with prompt, multi_modal_context, and required_columns (column_configs.py)ImageInferenceParams- Parameters: size, format, quality, steps, cfg_scale, seed, n (models.py)ImageUsageStats- Usage tracking for generated images (usage.py)🔧 Changed
Model System:
facade.py- Added methods:generate_image()- Main entry point with automatic API routing_generate_image_diffusion()- Diffusion model path viaimage_generationAPI_generate_image_chat_completion()- Autoregressive model path viacompletionAPI_track_token_usage_from_image_diffusion()- Usage trackingDataset Building:
column_wise_builder.py- IntegratedMediaStoragefor image artifact managementartifact_storage.py- Addedmedia_storageattributeVisualization:
visualization.py- Enhanceddisplay_sample_record()with image handling:_display_image_if_in_notebook()for IPython/Jupyter rendering (~132 lines added)Configuration & Registry:
ImageCellGeneratorin column generator registryColumnType.IMAGEenumerationPILinlazy_heavy_imports.pyDependencies:
pillowfor image processing🗑️ Removed
🔍 Attention Areas
facade.py:307-470- Image generation implementation with auto-detection logic and dual API supportmedia_storage.py- Storage abstraction with dual modes and file organization (UUID + column subfolders)image.py:62-67- Image generator with multi-modal context injectionvisualization.py:289-418- Image display integration indisplay_sample_record()🚀 Extensibility & Future Work
Extensibility to Other Modalities:
This implementation establishes patterns that extend naturally to other media types:
AudioColumnConfig+MediaStorage.save_audio()Key extensibility points:
ModelFacade- Addgenerate_audio(),generate_video()following same patternMediaStorage- Already designed for multiple media types (see comments about future audio/video support)GenerationTypeenum - Easy to addAUDIO,VIDEO, etc.ImageCellGeneratorpattern for new modalitiesPlanned Future Work:
Improve
display_sample_record()method - Enhanced notebook display with better layouts, grid views, and interactive controls for image-containing recordsMove
artifact_storage.pyto storage module - Consolidate all storage logic (MediaStorage,ArtifactStorage) underengine/storage/for better organization (done in chore: move ArtifactStorage to engine/storage/ module #321)Documentation - Feature currently has no docs except a tutorial notebook. (done in docs: add image generation documentation and image-to-image editing tutorial #319)
✅ Testing
Comprehensive test coverage (800+ lines):
ImageUsageStatsintegrationclose #125
🤖 Generated with AI