Inspiration
Families record more moments than ever, yet most of those recordings are never revisited. Videos sit in cloud drives, voice notes live in chat apps, and photos are scattered across devices. When someone is no longer around, those fragments become difficult to navigate and impossible to interact with.
We wanted to change how family history is preserved and experienced. Rather than creating another digital archive, we set out to build something people could talk to - while remaining honest about what is real and what is not. Heirloom is driven by the belief that memory should feel human, but it should also be verifiable. Every response is grounded in the original recordings, with exact citations back to the source material.
How we built it
Heirloom is built around a clear separation of concerns: extraction, grounding, and interaction.
Users upload existing family media - video, audio, images, or text - with a strict 100MB limit to ensure fast, reliable processing. Once a file exists in storage, our backend uses Google Gemini to analyse the content and extract structured “memory units”. These represent real events or moments, each with a title, summary, places, dates, and precise timestamps for audio and video.
For interaction, we use a voice-first interface. When a user asks a question, Gemini generates a response strictly from the extracted memories, and ElevenLabs synthesises the answer in the original person’s voice. Crucially, while the answer plays, the interface displays citations showing the exact clip and timestamp where that information came from. Users can open the source and see or hear the evidence themselves.
The system is designed to avoid hallucination. If a detail is not present in the uploaded media, it is not invented. Our frontend is built with Next.js, Tailwind, and Framer Motion to keep the experience calm and intuitive, while the backend uses FastAPI with asynchronous jobs to handle long-running media analysis safely.
Challenges
The hardest challenge was balancing emotional impact with technical honesty. It is easy to make an AI sound convincing; it is much harder to make it accountable. Ensuring every answer could be traced back to a real moment in the source material required careful schema design, timestamp handling, and strict prompting.
Processing long audio and video reliably within a hackathon timeframe was also challenging. We had to design an extraction pipeline that could segment content meaningfully, retry safely on failures, and avoid duplicating data when jobs were re-run.
Latency was another concern. For the experience to feel like a conversation, responses needed to arrive quickly. We optimised the flow by pre-processing memories, limiting context size, and streaming voice output as soon as text became available.
What we learned
We learned that trust is the most important part of AI-driven storytelling. Users are far more comfortable engaging emotionally when they can see the proof behind what the system says. Showing citations alongside voice responses changed how people reacted - they listened more closely and questioned less.
We also learned that small, well-defined constraints help creativity. Limiting upload size, keeping event types controlled, and enforcing grounded answers allowed us to move faster and build something more reliable within the hackathon.
Most importantly, we learned that AI can support memory without replacing it. When used carefully, it can help families reconnect with real voices, real moments, and real experiences - without pretending to be something it is not.
Built With
- amazon-web-services
- elevenlabs
- fastapi
- gemini
- nextjs
- postgresql
- python
- react
- s3-compatible-storage
- supabase
- swr
- tailwindcss
- typescript
Log in or sign up for Devpost to join the conversation.