Inspiration The idea for SurveiLens was inspired by a simple realization — most CCTV systems only record events, they don’t understand them. During late-night hackathons and campus security incidents, our team noticed that even with extensive surveillance coverage, human review always came too late. We wanted to build something that could analyze what cameras see and hear in real time, detect potential danger, and take immediate action — not just after the fact. Our inspiration came from the growing need for proactive intelligence in public safety, where CCTV cameras could act like digital security assistants, able to reason, respond, and alert before incidents escalate.

What it does SurveiLens transforms traditional CCTV networks into an AI-powered security system that can see, hear, and understand its surroundings. Using Gemini for visual analysis and ElevenLabs for audio transcription, the system identifies people, weapons, and key actions or sounds to detect potentially dangerous or abnormal events. Gemini returns a danger score (0–1) and danger level (low, medium, high) directly from its internal model reasoning. The system stores all events and metadata in JSON for logging and potential integration with downstream analytics platforms. Real-time alerts, snapshots, and transcripts provide actionable intelligence immediately.

How we built it We built SurveiLens by connecting CCTV cameras as AI nodes. Each camera captured frames every second, and the last 10-second sliding window of frames was sent to Gemini for analysis. Simultaneously, audio from the last 10 seconds was captured, saved, and transcribed using ElevenLabs. Both modalities — images and audio transcript — were sent together to Gemini, which returned structured JSON including people count, weapons detected, actions detected, and the danger score/level. Events, snapshots, and transcripts were logged locally, creating a complete timeline of analyzed events. Our Python pipeline handled frame sampling, audio chunking, and parallel API calls to Gemini and ElevenLabs. The system was designed to maintain a “real-time” view with minimal latency, simulating continuous monitoring without overloading resources.

Challenges we ran into Synchronizing live video and audio streams was one of the toughest challenges — ensuring that each frame matched its corresponding audio required careful tuning of buffer sizes and tick intervals. API latency between Gemini and ElevenLabs created bottlenecks, which we solved through parallel processing and efficient sliding-window batching. Managing large JSON outputs also required designing a clear folder and naming structure for snapshots, audio, transcripts, and Gemini responses. On the frontend or logging side, creating the perception of continuous video from discrete snapshots required careful timing and caching, so users could view events almost in real time.

Accomplishments that we're proud of We’re proud of how SurveiLens turns ordinary CCTV cameras into an intelligent, scalable AI network without requiring specialized hardware. We successfully combined multiple technologies — Gemini, ElevenLabs, threading, OpenCV, and structured logging — into a seamless pipeline that can analyze and reason over real-time multimodal data. The system reliably produces a danger score and level per 10-second window, along with snapshots and audio transcripts, giving a complete picture of events.

What we learned We learned how to design multimodal AI pipelines, where models process different types of data — image, sound, and language — and produce structured outputs. We also learned how to optimize latency in a continuous capture-and-analysis loop, how to manage sliding-window buffers for video and audio, and how to structure JSON and filesystem outputs for traceability and future analytics. Most importantly, we learned how to build AI systems that focus on understanding and preventing harm rather than just recording it.

What's next for SurveiLens Next, we plan to integrate real-time streaming protocols like WebRTC and explore LLM-driven behavioral reasoning, allowing SurveiLens to not only detect events but also anticipate potential risks. We also aim to build dashboards and automated alerting that summarize historical trends, support policy enforcement, and connect with emergency response systems. Our long-term vision is to make SurveiLens a standard for intelligent, multimodal surveillance infrastructure — accessible, ethical, and designed to protect.

Built With

Share this project:

Updates