Inspiration
Imagine a living time machine for Troy. Our portal turns historical records—old books, archives, even history podcasts—into short, narrated videos generated by AI. Instead of reading dry timelines, students can “watch” Troy’s past unfold as vivid, story-driven scenes: battles, protests, landmark openings, everyday life. Teachers get a ready-made, curriculum-friendly tool; museums and libraries get a modern way to showcase local history. And because the pipeline is fully AI-driven—from scraping historical text and audio to generating videos—the same platform can be scaled to any city, building an interactive atlas of urban history around the world.
What it does
The portal lets users browse a city’s history decade by decade through short, AI-generated films. On the interface, you simply choose a decade from a timeline or dropdown, and the system plays a video that visually walks through key events from that era—like a mini documentary for each slice of time. Each video is composed of narrated scenes and stylized images that bring historical moments to life, so instead of reading dense timelines, students and visitors can “watch” the story of the city unfold and quickly jump between different periods to explore how it changed over time.
How we built it
We built the system as a modular pipeline, wiring together separate components for data collection, AI generation, and presentation. First, we wrote Python scrapers to pull text from historical books, curated web pages, and podcast descriptions, then parsed that text into sentences, extracted years, and bucketed events by decade before calling the Gemini API to generate clean summaries into CSV files. Those CSVs feed into an "AI director" stage, another Gemini prompt that turns each decade's summaries into structured scene descriptions. We then use ElevenLabs' text-to-speech API to generate audio narration from those scene descriptions, creating a professional voiceover for each decade. Next, we pass the scene descriptions to an "AI artist" stage, where a Stable Diffusion pipeline renders one image per scene, and Python tooling (e.g., moviepy/ffmpeg) combines the narrated audio with the generated images to stitch the frames into video clips. Finally, we wrapped everything in a simple GUI that indexes videos by decade, letting users browse the generated archive and watch the history of the city chapter by chapter.
Challenges we ran into
We ran into multiple issues when using the Gemini API, from quota limits to Gemini refusing to generate summaries for text because the text was too large. We fixed these issues by throttling our queries to Gemini and breaking the CSV file generation according to source. We also faced an issue with the Elevenlabs API: we ran out of credits. We were graciously provided 100,000 additional credits by Major League Hacking, which allowed us to fix the issue.
Accomplishments that we're proud of
We’re proud that we got an end-to-end experience working: you can start from messy, unstructured historical sources and end up with a playable video for each decade in the UI. Along the way, we made Gemini act as both a historian and a director, keeping the summaries grounded while also turning them into coherent, visually interesting scene descriptions. We also managed to plug Stable Diffusion into that workflow in a controllable way, so each scene description reliably produces a meaningful frame instead of random art. Finally, we’re happy with how the GUI ties everything together—turning a fairly complex AI pipeline into something that feels simple and intuitive: pick a decade, hit play, and watch the city’s story unfold.
What we learned
We learned a lot about how hard it is to keep an end-to-end AI pipeline both accurate and coherent when you’re chaining multiple models together. On the data side, we saw that scraping historical sources is messy: dates are inconsistent, events are duplicated, and small parsing mistakes can push an event into the wrong decade, so we had to design careful heuristics and checks. Prompting Gemini to behave like both a factual historian and a creative director taught us how sensitive summaries and scene descriptions are to prompt wording and length. On the generative side, we learned that Stable Diffusion will happily ignore structure unless you give it very clear, constrained prompts, and that small changes in phrasing can dramatically change visual style. Finally, we realized how important a simple UI is: the more complex the backend got, the more we had to simplify the front-end so users just feel like they’re browsing a clean, interactive time machine rather than driving a pile of APIs.
What's next for RetroVision
Next for RetroVision, we want to grow it from a Troy-specific demo into a reusable “time machine” for many cities. On the content side, that means plugging in more curated sources (local archives, museum collections, city datasets) and making the video more lively and dynamic. On the AI side, we’d like to refine the director/artist loop so scenes stay consistent across decades, experiment with faster/cheaper image models, and add guardrails to keep everything historically grounded. And on the product side, we’re excited about building a classroom-friendly mode—lesson playlists, teacher notes, and maybe even a Q&A layer where students can ask Gemini follow-up questions about what they just watched.
Log in or sign up for Devpost to join the conversation.