ClampsAI: Agentic, Actionable Multi-Camera Security System

Inspiration

Security cameras are everywhere, but they're only useful if someone is actively watching them. We saw this problem firsthand in retail stores and public spaces where security personnel struggle to monitor multiple feeds simultaneously. This inspired us to create ClampsAI - a system that could intelligently monitor multiple cameras and alert security personnel only when necessary. This enables us to build an autonomous system that intelligently monitors multiple security feeds and takes the appropriate action—whether it's notifying emergency services with real-time details, alerting security personnel, or contacting a designated family member in relevant situations. ClampsAI aims to revolutionize response times, enhance situational awareness, and transform security monitoring.

What it does

ClampsAI is a real-time surveillance system that:

Monitors multiple cameras to detect threats simultaneously
Utilizes parallel AI agents powered by Gemini AI to detect potential threats in each feed
Cross-references multiple cameras to reduce false positives
Identifies and highlights threatened areas and malicious activities
Makes automated calls to appropriate authorities with realtime incident reporting
Stores annotated incident data reports for future review and security analysis

How we built it

We built ClampsAI iteratively:

Created a WebRTC-based frontend for multi-camera capture
Created parallel threat detection agents using Gemini Flash 2.0 by combining multiple security camera streams. Enhanced real-time Multimodal API for Gemini by processing 3 security feeds at once while also handling phone calling.
Integrated cross-camera synthesis using reasoning models to determine threat intensity for individual streams and appropriate next steps, and create context rich call content to speak to emergency service providers or emergency contacts (based on threat type identification)
Implemented Twilio and Eleven Labs to initiate and execute appropriate calls with extremely low latency. We boast 5-10 second call times to emergency service providers once a threat has been commenced. Once on the phone line with the emergency service provider, we have sub 3 second responses from the multimodal streaming system to respond to natural language queries.

Our tech stack:

Frontend: NextJS
Backend: Google Gemini Flash 2.0 , Flask, Twilio, Eleven Labs
Hardware: HD WebCams

Challenges we ran into

One of the largest challenges we faced was with the gemini multimodal live api endpoint receiving only one stream of video. Our goal was to synchronize multiple video streams and perform natural language understanding tasks over them. In order to do this, we had to reverse engineer the Gemini Multimodal Live API to stream live bits of video from each individual streaming node into Gemini 2.0 FLASH in order to replicate the experience of using Multimodal Live. Gemini 2.0 Flash's video understanding with low latency allowed us to accomplish this.

Another large challenge we saw had to do with high latency. We have multiple agentic steps with a lot of data coming in continously but need low latency for our realtime phone calling feature. We fixed this by parallelizing our threat analysis and cross-camera synthesis and experimenting with our video capturing pipeline to reduce latency.

Accomplishments that we're proud of

Achieved real-time threat detection with minimal latency
Re-engineered multimodal Live API using streamed video through Gemini Video understanding API for realtime querying with multiple video streams
Successfully reduced false positives through cross-camera synthesis and smart prompting techniques
Accomplished accurate and high threat detection capabilities with context-rich deliverables and content to emergency services providers
Built a working multi-camera surveillance system that takes action in real-time and attends to the emergency in real-time by calling the right emergency contact.

What we learned

Performance capabilities of real-time language model APIs
Gemini reasoning models, tool calling, Twilio for emergency calls, Eleven labs
Importance of cross-referencing data to reduce false positives
Multimodal input to enhance security application, threat detection, and response