Janus

Technical Diagram

Inspiration

Every parent knows the feeling: their child picks up a rock, a bug, a leaf, and erupts with questions. Why is it that color? What does it eat? Can I keep it? That energy — raw, relentless curiosity — is one of the most precious things in early childhood. And right now, it has nowhere good to go.

The internet wasn't built for a five-year-old. Typing into a search box isn't how childen think. A chat window isn't how children play. And a black-box AI that parents can't see into isn't something families should have to trust blindly.

We built Janus because we believed child-facing AI could be better on both sides of that equation: more alive for the child, more transparent for the parent. Not another chatbot. A companion that shares the child's world — and reports back to the person responsible for it.

What It Does

Janus is a real-time multimodal AI companion for young children. It runs in the browser, takes in live camera and microphone streams, speaks back with native audio, and reacts to what the child is actually doing in the room — not just what they type.

On the child's side:

Janus sees through the webcam and listens through the mic, processing live video and audio via the Gemini Live API
It speaks back naturally, in real time, with a warm and age-appropriate voice
It can identify objects the child holds up, ask follow-up questions, and launch short guided activities
It surfaces safe visual content — searching for real images first via Vertex AI Search, falling back to Imagen 3 generation only when needed — because grounded reality teaches better than hallucinated imagery
It can place generated AR-style characters onto real surfaces in the child's environment, rendered in screen space and manipulable via MediaPipe hand tracking and pinch gestures
It can run scavenger hunts: "find something red," "find something round," "bring a leaf close to the camera"
It launches AR teaching overlays — structured visual moments that label parts of an object, prompt a task, and celebrate completion
Critically, it knows when to be quiet. A silence/wake analysis layer evaluates recent audio and video context to decide whether Janus should stay out of the way of independent play or gently re-engage

On the parent's side:

Every session is logged to Firestore: transcripts, activities, summaries, and at-risk events
A dedicated parent portal lets parents review session history, read AI-generated summaries, scan flagged moments, and ask natural-language questions about what happened ("Was my child upset today? What did they learn about?")
Safety and distress events trigger structured alerts that surface in the parent review UI
For significant at-risk moments, the system assembles actual audio+video clips using ffmpeg, so parents have evidence rather than just text flags

Janus doesn't just make AI feel alive for a child. It makes that same interaction legible and trustworthy for the parent. That dual-surface architecture is the core of what we built.

How We Built It

Realtime Engine The heart of the app is GeminiInteractionSystem, an orchestration class that manages the full Gemini Live session lifecycle: audio/video streaming, transcript accumulation, tool registration, tool execution, session persistence, silence analysis, and AR state. We used the Google GenAI SDK throughout, with Gemini 2.5 Flash handling structured helper tasks alongside the Live session.

Tool-Driven Interaction Model Rather than letting the model only speak, we gave Gemini a tool interface that turns it into an interaction orchestrator. Registered tools include show_visual, start_ar_teaching, update_ar_overlay, start_scavenger_hunt, celebrate_hunt_success, place_generated_ar_object, and clear_generated_ar_object. This means Janus doesn't just tell a child about a flower — it can label the petals, prompt a counting task, and trigger a celebration animation, all driven by model tool calls.

Visual Retrieval Strategy We deliberately chose real images over generated ones whenever possible. We integrated Vertex AI Search for image retrieval, then immediately hit a real-world problem: early results were polluted with app-store pages, Google utility shells, and non-image metadata. We built a custom scoring and filtering pipeline — domain blocklists, title keyword filtering, image URL extraction, host quality scoring, and query-overlap ranking — to make retrieval actually usable. Only when search fails does the system fall back to Imagen 3 generation. This wasn't a small decision; it was a full engineering pivot mid-build.

Silence and Social Awareness One of the more ambitious product decisions was teaching Janus to be quiet. We built a silence/wake analysis layer that evaluates rolling audio and video history to decide whether the child is happily playing independently, addressing Janus, talking to someone else, or showing signs of frustration. This makes the agent feel less like an interruptive notification and more like a socially aware presence.

AR Without Native AR Hardware We created a grounded AR experience without ARKit or ARCore by using Gemini-based surface detection from the live camera frame, normalized bounding box anchoring, screen-space overlay rendering, and MediaPipe hand tracking for pinch-gesture object manipulation. The system degrades gracefully: if anchor detection fails, a fallback bounding box is used. If GPU hand tracking fails, CPU tracking takes over.

Parent Portal and Safety Architecture Session data is persisted to Firestore and shaped for parent consumption. Safety logic runs in layers: prompt-level constraints, a binary helper model safety evaluation, transcript pattern scanning, and event storage with media clips. Alerts are normalized and surfaced in the parent UI, and a dedicated parent-side Gemini assistant can answer natural-language questions about any session.

Deployment The full backend is containerized and deployed to Cloud Run. A deploy.sh script handles GCP project selection, billing detection, service enablement, service account setup, secret configuration, and Cloud Run deployment — automated infrastructure, not a manual process.

Challenges We Ran Into

Search quality was the first wall we hit. Vertex AI Search did not return usable images out of the box. The early results included Play Store listings, Google support pages, and irrelevant metadata shells. We had to build a real filtering and scoring pipeline before the visual layer worked at all. The repo commit history documents this battle explicitly — from "add Discovery Engine dependency" through "add script to inspect raw Vertex AI Search responses" to "image search rewrite with scoring/filtering."

Image retrieval and image rendering are two different problems. Even after the backend found good results, the frontend still broke on certain images due to CDN quirks and loading edge cases. We patched the last mile with better frontend image handling and graceful fallback visuals.

Multimodal real-time orchestration is mostly state management. Running live audio input, output audio playback, tool calls, silence logic, UI overlay management, and session logging simultaneously is fragile. The hardest parts weren't model calls — they were coordinating interruption handling, preventing double-responses, and knowing when to disconnect or resume.

Making the agent quiet was almost harder than making it talk. Building a bot that always responds is easy. Building one that recognizes independent play and stays out of it required explicit silence modeling, rolling buffer analysis, and helper model evaluation.

Parent trust is a product problem, not just a feature flag. Storing transcripts is not the same as giving parents genuine visibility. We had to design summaries, alert structures, Q&A flows, and event review UX before the parent layer felt trustworthy rather than performative.

Accomplishments We're Proud Of

Janus uses genuine multimodality: live microphone, live camera frames, voice output, visual retrieval, image generation, gesture interaction, and parent-side session interpretation. This is not marketing multimodality.
We built two products that reinforce each other. The child companion is more accountable because the parent portal exists; the parent portal is more useful because the child experience is productized enough to generate structured data.
We solved ugly retrieval reality instead of pretending search would work. The image pipeline is a real engineering accomplishment because it required judgment about when to trust results and when to filter them.
Fallbacks are everywhere, and that makes the demo resilient. Search fails → Imagen. Anchor detection fails → fallback box. GPU fails → CPU. Image load fails → placeholder card. The product only works because it learned how to fail gracefully.
The silence/wake logic is unusually thoughtful for a prototype. It makes Janus feel less like a feature and more like a social entity.

What We Learned

Retrieval quality matters more than adding more AI. A weak retrieval layer can make an advanced multimodal system look incompetent. Data quality and candidate filtering were decisive in whether the visual experience worked at all.

Real-time UX is mostly state management. The hard part isn't calling a model. It's coordinating speech, timing, silence, tool calls, UI transitions, and context handoff without the experience feeling disjointed.

Parent trust is a product feature, not a legal footnote. Summaries, alerts, and context review materially change how acceptable a child AI system feels to the adults responsible for the child.

Spatial illusion can be enough. You don't need perfect AR hardware to communicate the future of embodied AI interaction. You need enough grounding, consistency, and responsiveness to make the interaction feel real.

Building the parent portal pressure-tested the child experience. Once we had to show parents what happened, we were forced to structure the child-side interaction more clearly. The second interface revealed the truth of the first.

What's Next for Janus

Improve alert precision beyond regex and heuristic matching
Richer AR anchoring and smoother hand interaction across devices
Session playback and timeline review for parents
More structured learning modes alongside freeform interaction
Multi-child / multi-parent account model
Turn Janus from a compelling prototype into a trusted household copilot for early learning and parent awareness

Technologies Used

Gemini Live API — real-time audio/video multimodal interaction
Gemini 2.5 Flash Audio / Flash Lite — structured helper tasks, silence analysis, at-risk evaluation
Google GenAI SDK — all model integration
Vertex AI Search — real image retrieval
Imagen 3 — generated visuals and AR sprites
YouTube Data API — video surfacing
Google Cloud Run — backend deployment
Firebase / Firestore — session persistence
MediaPipe — hand tracking and pinch gesture interaction
ffmpeg — audio/video clip assembly for at-risk events
Socket.io — real-time frontend/backend communication
Express + TypeScript — backend server
Vite — frontend build