Inspiration

We've all been there -- you ship a feature, and now you need to show it to someone. An investor, a customer, a teammate in another timezone. So you open Loom, stumble through a click-through, restart because you misclicked, restart again because you forgot what to say, spend twenty minutes editing out the dead air, and end up with a mediocre video that's outdated by your next commit. The average SaaS company spends 5-8 hours a week producing demo content. We thought: what if an AI agent could just use your app and narrate what it sees?

What it does

DiffCast turns any URL into a narrated demo video. Paste a link -- localhost, staging, or production -- and an AI agent autonomously navigates your app, clicks buttons, fills forms with realistic data, and records everything. The output is a polished .mp4 and a time-synced AI voiceover explaining each step. Pick from 5 personalities, edit the title, and share via a public link or download.

We also built a GitHub Actions integration that auto-generates a demo video on every pull request -- it deploys the PR branch, reads the diff, generates a walkthrough description based on the changes, generates the demo video with our tool, and embeds the video directly in the PR description. Your demos update themselves.

Usage-based billing via paid.ai tracks every dollar of OpenAI and ElevenLabs spend per generation, with a full analytics dashboard showing cost-per-demo, usage history, and invoice details. Stripe handles checkout.

How we built it

DiffCast is a four-phase autonomous pipeline:

  1. Browser Agent -- An LLM-powered agent receives a URL, reads a pruned DOM snapshot (filtered to semantic/interactive elements), and decides the next action (click, fill, scroll, navigate). Playwright executes each action in headless Chromium with video recording, while a custom SVG cursor overlay and smooth-scroll injection make the recording look human and professional.
  2. Narrated Voiceover -- The action timeline makes a second LLM call that generates a time-stamped narration script. Each segment is synthesised via ElevenLabs (parallel API calls with a concurrency semaphore), then time-aligned with ffmpeg using adelay filters and mixed into a single track. Audio is muxed onto the video with stream-copy to avoid double-encoding.
  3. Metadata & Upload -- A third LLM call generates a title and description. The final .mp4 uploads to Google Cloud Storage allowing for persistence (on a per-account basis) and a publically shareable URL.
  4. Billing -- OpenAI and ElevenLabs costs are tracked as OpenTelemetry spans via paid.ai, with a per-demo billing signal containing full cost metadata.

Stack: React + Vite + Tailwind + shadcn/ui (frontend), FastAPI + Playwright + MoviePy + ffmpeg (backend), Clerk (auth), Stripe (payments), Google Cloud Run (deployment), Secret Manager (secrets), Artifact Registry + GitHub Actions (CI/CD), paid.ai (usage-based billing), ElevenLabs (TTS), OpenAI/Anthropic (agent planning + narration + metadata).

Challenges we ran into

Playwright in containers: Chromium requires ~20 system libraries and produces a ~2.7GB image. We hit ARM/AMD64 mismatches deploying from Apple Silicon to Cloud Run, and had to handle 15-30s cold starts when scaling from zero.

Hidden UI and multi-page support: Our initial browser agent strategy was unable to record expandable elements or multiple pages, so we switched to a repeated three stage approach, DOM scrape, LLM action generation, and playwright action execution.

LLM getting stuck: The browser agent would occasionally get stuck on certain UI elements, so we added a repair mechanism to ensure each playwright action was executed successfully, and on fail it retried for a maximum of four attempts.

ElevenLabs syncing: Our narration pipeline would often overlap for actions that were close together, as ElevenLabs tts is unpredictable in length and variable based on voice morphology. To fix this we implemented a custom timeline stitching mechanism with smart overlap detection to ensure this did not happen in real-time.

Accomplishments that we're proud of

  • Early-traction: During development at the hackathon, we had other teams asking for automatic demo video generation due to the ease-of-use.
  • End-to-end autonomy: From a single URL to a production-ready narrated video with zero human intervention. The agent handles arbitrary web apps, not just pre-configured flows.
  • The PR demo bot: Every pull request automatically gets an embedded video showing what the code change looks like in the browser. This alone could be a product.
  • Production-grade deployment: Fully containerised on Google Cloud Run with Secret Manager for credentials, CI/CD auto-deploy on push to master, usage-based billing, Stripe checkout, and a polished shareable video page.
  • Transparent unit economics: Every generation tracks exact LLM and TTS costs through paid.ai. We know precisely what each demo costs and can price accordingly.

What we learned

Building an AI agent that interacts with arbitrary web UIs is fundamentally different from building a chatbot. The DOM is messy, selectors break, pages load asynchronously, and the agent needs to reason about visual state from semantic HTML alone. The 4-second minimum action pause, initially added for video pacing -- turned out to be essential for page stability too.

On the business side, integrating cost tracking from day one made unit economics visible immediately. Most AI products have no idea what a single user action costs them. Having that number ($0.20-0.30 per demo in LLM + TTS) shaped our pricing model and gave us confidence that the product can scale sustainably.

Initially, we were skeptical about the effectiveness of a browser agent to generalise on different websites, but after rigorous testing and feedback-driven optimisations, we were successful and learned a ton in the process.

What's next for diffcast

  • Scheduled regeneration: Set a cron to regenerate demos nightly or on each deploy, so documentation and sales assets never go stale.
  • Wrap the PR workflow in an easy-to-deploy github bot for everyone to use.
  • API access: Expose the pipeline as an API so teams can integrate demo generation into their CI/CD, documentation tooling, or customer onboarding flows.
  • Post-processing pipeline that removes long pauses or loading screens to save video bandwidth

Built With

Share this project:

Updates