fibostudio - AI-Powered 3D Product Photography Studio

Inspiration

The inspiration for fibostudio came from observing the massive gap between how product photography is traditionally done and how it could be done with modern AI. We noticed:

  • E-commerce businesses spend thousands on professional photographers and studio equipment
  • Content creators struggle to maintain visual consistency across product catalogs
  • Small businesses can't afford professional photography for their products
  • Iteration cycles are painfully slow - one photoshoot can take hours or days

We asked ourselves: What if you could design your product scene visually, then let AI generate perfect product photos that match exactly what you see?

The breakthrough came when we realized that combining 3D scene composition with AI image generation could eliminate the photographer entirely. Users could become their own studio directors - positioning objects, adjusting lighting, and speaking commands to modify their scene in real-time.

What it does

fibostudio is a web-based virtual photography studio that revolutionizes product image creation:

Core Functionality

  1. 3D Scene Editor

    • Drag-and-drop object positioning with real-time 3D preview
    • Intuitive transform controls (Move, Rotate, Scale)
    • Real-time lighting adjustments (key light, fill light, rim light)
    • Background and environment customization
    • Multiple mood presets (Clean, Dark, Warm, Cool)
  2. Voice-Controlled Studio Director

    • Speak natural language commands to modify your scene
    • "Make it look cinematic with dramatic lighting"
    • "Add warm golden hour lighting"
    • "Change background to white"
    • Real-time voice feedback using Eleven Labs text-to-speech
  3. AI-Powered Image Generation

    • Detects exact camera angle from 3D preview
    • Generates photorealistic product images matching your scene
    • Two generation modes:
      • Plain: White background, minimal styling, clean product shots
      • Professional: YouTube-ready, product launch quality, dramatic lighting
  4. Project Management

    • Save and organize multiple product photography projects
    • View production gallery of all generated images
    • Download high-quality product photos
    • User authentication with JWT
  5. Exact View Matching

    • AI understands precise camera positioning (front, side, top-down, angles)
    • Generates images that match your 3D preview exactly
    • No AI creativity - follows your specifications precisely
    • Respects object positioning, rotation, and scale

How we built it

Architecture

Frontend Stack:

  • React 18 with TypeScript for type-safe UI
  • Three.js + React Three Fiber for 3D scene rendering
  • Tailwind CSS for responsive styling
  • Web Speech API for voice input
  • Eleven Labs API for voice synthesis

Backend Stack:

  • Node.js + Express for REST API
  • MongoDB Atlas for data persistence
  • JWT authentication for secure user sessions
  • Middleware for CORS and request validation

AI Integration:

  • Google Gemini 3 for natural language understanding of voice commands
  • FAL.ai for reliable image generation
  • Eleven Labs for natural voice synthesis and feedback

Key Implementation Details

  1. 3D Scene Management

    • Real-time camera angle detection from Three.js scene
    • Precise angle calculations (horizontal and vertical degrees)
    • Distance-based framing detection (close-up, medium, full, wide shots)
    • Object rotation and positioning tracking
  2. Voice-to-Scene Pipeline

    • Web Speech API captures user voice
    • Transcribed text sent to Gemini for scene understanding
    • Gemini generates structured scene modifications
    • React state updates 3D scene in real-time
    • Eleven Labs synthesizes confirmation response
  3. Image Generation Pipeline

    • Frontend detects exact camera view from 3D preview
    • Builds precise prompt describing view, style, and specifications
    • Backend proxies request to FAL.ai with strict constraints
    • Generated image returned and displayed in production gallery
  4. Prompt Engineering

    • Single-paragraph prompts for clarity
    • Exact view descriptions (e.g., "front view at eye level")
    • Style-specific instructions (plain vs professional)
    • No JSON parameters - pure natural language

Development Process

  1. Phase 1: Built 3D scene editor with Three.js and React Three Fiber
  2. Phase 2: Integrated Gemini for natural language scene modifications
  3. Phase 3: Added voice input with Web Speech API
  4. Phase 4: Implemented Eleven Labs for voice feedback
  5. Phase 5: Integrated FAL.ai for image generation
  6. Phase 6: Refined prompt engineering for exact view matching
  7. Phase 7: Added project management and user authentication

Challenges we ran into

1. Exact View Matching

Challenge: AI was generating images from different camera angles than the 3D preview showed.

Solution:

  • Implemented precise angle detection (horizontal and vertical degrees)
  • Created detailed spatial descriptions in prompts
  • Added distance-based framing detection
  • Built view type classification system (top-down, high-angle, eye-level, low-angle)

2. CORS and API Integration

Challenge: Direct frontend-to-external-API calls were blocked by CORS policies.

Solution:

  • Built backend proxy for all external API calls
  • Configured CORS middleware for frontend origin
  • Centralized API key management on backend

3. Voice Recognition Reliability

Challenge: Web Speech API was inconsistent and sometimes failed silently.

Solution:

  • Implemented proper error handling and user feedback
  • Added silence detection (2-second timeout)
  • Auto-submit on speech completion
  • Clear visual indicators for recording state

4. Prompt Complexity

Challenge: Complex JSON-based prompts confused the AI and resulted in wrong images.

Solution:

  • Simplified to single-paragraph natural language prompts
  • Removed all JSON parameters from prompts
  • Focused on clear, direct instructions
  • Added style-specific prompt variations

5. API Key Management

Challenge: Managing multiple API keys (Gemini, FAL.ai, Eleven Labs) across environments.

Solution:

  • Centralized environment variable configuration
  • Separate .env files for frontend and backend
  • Clear documentation for setup
  • Fallback mechanisms for API failures

6. Real-time 3D Rendering Performance

Challenge: Smooth 3D preview with complex lighting calculations.

Solution:

  • Optimized Three.js scene with proper LOD
  • Efficient lighting setup with key, fill, and rim lights
  • Responsive camera controls with OrbitControls
  • Proper shadow mapping configuration

7. Exact Object Positioning

Challenge: Objects starting in side view but AI generating front views.

Solution:

  • Implemented precise camera angle detection from 3D scene
  • Added degree-based angle calculations
  • Created view type classification system
  • Included exact positioning in prompts

Accomplishments that we're proud of

1. Voice-Controlled Studio Director

Successfully integrated Web Speech API, Gemini, and Eleven Labs to create a seamless voice-controlled interface. Users can literally speak their vision and watch the studio update in real-time.

2. Exact View Matching

Solved the critical problem of AI generating images that match the exact camera angle from the 3D preview. This required sophisticated angle detection and prompt engineering.

3. Full-Stack Integration

Built a complete end-to-end system connecting:

  • 3D scene editor (Three.js)
  • Natural language processing (Gemini)
  • Voice I/O (Web Speech API + Eleven Labs)
  • Image generation (FAL.ai)
  • User management (MongoDB + JWT)

4. Intuitive User Experience

Created an interface so intuitive that users can generate professional product photos without any photography knowledge.

5. Plain vs Professional Modes

Implemented two distinct generation modes that produce completely different results based on user intent - from minimal white-background shots to YouTube-ready professional photography.

6. Real-time Feedback Loop

Built a system where users see their 3D scene, hear voice confirmation, and get generated images - all in a seamless workflow.

7. Scalable Architecture

Designed a backend that can handle multiple concurrent image generation requests with proper error handling and fallbacks.

What we learned

1. Prompt Engineering is Critical

  • Simple, direct prompts work better than complex JSON structures
  • Natural language is more effective than structured data for AI
  • Single-paragraph prompts are clearer than multi-part instructions

2. Camera Angle Detection Requires Precision

  • Exact degree calculations matter for view matching
  • Combining horizontal and vertical angles creates unique view types
  • Distance-based framing significantly impacts composition

3. Voice Interfaces Need Careful UX

  • Users need clear visual feedback during recording
  • Silence detection is essential for auto-submission
  • Error messages should be helpful, not technical

4. API Integration Complexity

  • CORS issues require backend proxying
  • Multiple API keys need centralized management
  • Fallback mechanisms are essential for reliability

5. 3D Rendering Performance Matters

  • Proper scene optimization is crucial for smooth interaction
  • Lighting calculations can be expensive
  • Real-time updates require efficient state management

6. User Authentication is Non-Negotiable

  • JWT tokens provide secure, stateless authentication
  • Demo mode is valuable for user onboarding
  • Project persistence requires robust database design

7. AI Creativity vs Precision

  • Sometimes you want AI to follow instructions exactly, not be creative
  • Negative prompts are as important as positive ones
  • Style consistency requires explicit specification

What's next for fibostudio

Short Term (Next 1-2 months)

  1. Enhanced Voice Commands

    • Support for more complex scene modifications
    • Batch operations ("Generate 5 variations")
    • Undo/redo for voice commands
  2. Improved Image Generation

    • Support for multiple objects in single scene
    • Custom background images
    • Texture and material customization
  3. Performance Optimization

    • Faster image generation with caching
    • Optimized 3D scene rendering
    • Progressive image loading

Medium Term (3-6 months)

  1. Advanced Features

    • 360-degree product views
    • Animation support (rotating products)
    • A/B testing for different product shots
    • Batch generation with variations
  2. Collaboration Tools

    • Team projects and sharing
    • Comments and feedback on generated images
    • Version history and rollback
  3. Integration Ecosystem

    • Shopify integration for direct product uploads
    • WooCommerce plugin
    • API for third-party developers
    • Zapier integration for automation

Long Term (6-12 months)

  1. Enterprise Features

    • White-label solution for agencies
    • Custom branding and watermarks
    • Advanced analytics and reporting
    • Bulk processing for large catalogs
  2. AI Enhancements

    • Fine-tuned models for specific product categories
    • Automatic lighting optimization
    • Smart background suggestions
    • Consistency across product lines
  3. Monetization

    • Freemium model with usage limits
    • Pro tier with unlimited generations
    • Enterprise licensing
    • API access for developers
  4. Mobile Experience

    • Native mobile apps (iOS/Android)
    • Mobile-optimized 3D editor
    • Voice commands on mobile
    • Offline support

Vision

fibostudio aims to become the standard tool for product photography in e-commerce. We envision a future where:

  • Small businesses can compete with large enterprises on product image quality
  • Content creators generate consistent, professional product shots in minutes
  • E-commerce platforms offer built-in product photography tools
  • Photographers focus on creative direction rather than technical execution
  • Product launches happen faster with AI-generated marketing materials

The ultimate goal is to democratize professional product photography and make it accessible to everyone.


Technical Specifications

Frontend:

  • React 18, TypeScript, Vite
  • Three.js, React Three Fiber, @react-three/drei
  • Tailwind CSS, Lucide React icons
  • Web Speech API, Eleven Labs SDK

Backend:

  • Node.js, Express, TypeScript
  • MongoDB Atlas, Mongoose ODM
  • JWT authentication, bcryptjs
  • CORS middleware, error handling

APIs:

  • Google Gemini 3 (natural language)
  • FAL.ai (image generation)
  • Eleven Labs (voice synthesis)

Built With

Share this project:

Updates