Inspiration
In Kenya, there is a massive informal manufacturing and repair sector known as Jua Kali, alongside incredibly talented artisans crafting everything from Maasai beadwork to Kisii soapstone carvings. These creators produce beautiful, high-quality items, but they face a critical bottleneck: marketing.
Competing on a global scale requires high-end editorial photography, punchy social media copy, and engaging video content. For a small business owner or independent artisan, hiring a creative agency is out of the question. We wanted to bridge this gap by giving every local artisan a world-class, culturally-aware creative director in their pocket.
What it does
Soko Studio is a multimodal AI agent that flips the traditional "chat box" paradigm on its head. Instead of a conversational back-and-forth, we built an agent that takes a single raw, unedited product photo and outputs a massive, interleaved multimedia campaign in one go.
When a user uploads a photo to our luxury editorial dashboard, the agent generates:
- The Concept: A catchy campaign title.
- The Copy: A localized Instagram caption (using a mix of English and Sheng) with targeted hashtags.
- The Visual: A high-end, studio-quality lifestyle photograph featuring the product.
- The Audio: A 15-second voiceover script, synthesized instantly into audio for TikTok/Reels.
How we built it
To make this seamless, we combined the raw reasoning power of Gemini with specialized generative and Google Cloud services. Our backend is a Node.js/Express application hosted securely on Google Cloud Run.
- The "Brain" (Gemini 2.5 Pro): We utilized the newly released
@google/genaiNode SDK to pass the Base64 image and a strictsystemInstructionto Gemini. We forced Gemini to output its strategy using strict XML-style tags to prevent hallucinations. - The "Eyes" (Imagen 4): Gemini writes an incredibly detailed
<image_prompt>for a lifestyle shot. Our backend extracts this prompt and immediately fires it off to Imagen 4 (imagen-4.0-generate-001) to synthesize a stunning, high-resolution lifestyle photograph. - The "Voice" (Google Cloud Text-to-Speech): Gemini writes a
<voiceover_script>for short-form video content. Our backend uses a regex parser to strip out stage directions (like(Upbeat music)) and uses a high-quality Neural voice (en-GB-Neural2-B) via the Google Cloud TTS API to synthesize an MP3. - The "Canvas" (Frontend UI): We built a Vercel-like, luxury editorial dashboard using Vanilla HTML/CSS (with modern fluid typography and OKLCH colors) to render the interleaved JSON payload dynamically.
- The Infrastructure: We automated our deployment to Google Cloud Run using a custom
deploy.shbash script, earning bonus points for Infrastructure-as-Code.
Challenges we ran into
- Formatting Hallucinations: Initially, extracting the interleaved data from Gemini was tricky. We solved this by injecting a strict XML-tagging format directly into the user prompt and using robust regex parsing (
new RegExp('<tag>(.*?)</tag>', 'is')) on the backend to reliably extract multiline text. - Voiceover Stage Directions: Our "Creative Director" persona naturally included stage directions like
[0-3s]or(Sound of beads). The TTS API would read these aloud, ruining the illusion. We wrote a regex filter to dynamically strip text within brackets and parentheses before synthesizing the audio. - Cloud IAM Permissions: Automating the deployment of a public-facing Cloud Run service using a script required navigating strict Organization Policies (Domain Restricted Sharing). We had to manually inject a specific annotation (
run.googleapis.com/invoker-iam-disabled: 'true') to ensure the agent was publicly accessible for the judges.
Accomplishments that we're proud of
We are incredibly proud of breaking out of the "turn-based chatbot" mold. We successfully built a true Creative Storyteller that parallelizes multiple AI models (Gemini 2.5 Pro + Imagen 4 + Cloud TTS) to generate a cohesive, multi-sensory output. We are also proud of the frontend UI; it doesn't look like a generic AI tool, but rather a premium editorial suite that respects the artistry of the products being uploaded.
What we learned
We learned just how powerful the new gemini-2.5-pro model is at spatial and cultural reasoning. It didn't just see a "bracelet"; it saw the cultural context of the beadwork and knew exactly how to market it. We also learned how to effectively chain different Google Cloud models together (using the output of one as the explicit input for another) to create complex, autonomous workflows.
What's next for Soko Studio
We want to expand Soko Studio's integration capabilities. The next step is adding OAuth so users can directly publish the generated campaigns to their Instagram or TikTok accounts with a single click. We also want to explore the Gemini Live API for the platform, allowing artisans to talk to the creative director in real-time while holding up their products to their phone cameras
Log in or sign up for Devpost to join the conversation.