Skip to content

Conversation

@felixarntz
Copy link
Member

Follow up to #155.

I hadn't yet tested things when it was merged, so this PR has the fixes needed to actually make things work.

Note: A workaround is included to return the image generation specific class when using Gemini multimodal output models that primarily are used for image generation. This is only since we don't have model class implementations yet that are actually multimodal. Let's discuss in a separate issue what's the best way to go about that, for a proper solution. Doesn't have to block this work.

@felixarntz felixarntz added this to the 0.4.0 milestone Dec 30, 2025
@felixarntz felixarntz added the [Type] Bug An existing feature does not function as intended label Dec 30, 2025
@github-actions
Copy link

github-actions bot commented Dec 30, 2025

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: felixarntz <flixos90@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@felixarntz
Copy link
Member Author

felixarntz commented Dec 30, 2025

Quick experiment I did to test this:

Imagen 4 image generation

php cli.php 'Photo of a tricolor Cavalier King Charles Spaniel on an airfield in the desert of Peru' --outputFormat=image-base64 --providerId=google

Output:
cavalier-imagen-4-generate-001

Gemini Nano Banana image generation

php cli.php 'Photo of a tricolor Cavalier King Charles Spaniel on an airfield in the desert of Peru' --outputFormat=image-base64 --providerId=google --modelId=gemini-2.5-flash-image

Output:
cavalier-gemini-2.5-flash-image

Summary

Confirms what's widely known: multimodal output models that do image generation understand things much better. Both images look solid, but only the Gemini Nano Banana image has a tricolor Cavalier King Charles Spaniel, like I asked for in both cases :)

The other thing is that those models create more realistic looking images, while classic diffusion models create more "artsy" images. The depth of field in the Imagen-generated image is way too extreme - it looks cool, but not realistic.

Copy link
Member

@JasonTheAdams JasonTheAdams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you tested! Let me know which Issue you open to discuss multi-modal models.

@felixarntz felixarntz merged commit e0acf10 into trunk Dec 30, 2025
7 checks passed
@felixarntz felixarntz deleted the add/proper-google-provider-implementation branch December 30, 2025 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Type] Bug An existing feature does not function as intended

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants