Inspiration

Every few of weeks, a new but unfortunate headline appears in the news, "X injured in shooting at Y". Its tragedies like that have become common in todays world which is what inspired us to create Akawa. An AI powered, multimodal detection algorithm capable of detecting weapons, guns, fights, and suspicious sounds giving immediate awareness to what could be happening before anyone gets hurt.

What it does

Akawa is a full-stack multimodal threat detection platform that works with existing CCTV hardware:

  • Weapon detection — YOLOv11 with BoT-SORT tracking identifies guns and knives in real time
  • Violence detection — Two-Stream CNN-LSTM model analyzes optical flow across frame sequences to detect fights and brawls
  • Fall / medical emergency detection — YOLO-pose with keypoint biomechanics (spine angle, head-ankle proximity)
  • Audio threat detection — LSTM classifier detects gunshots and glassbreak; Qwen2-Audio-7B provides contextual narration
  • Incident reports — Auto-generated PDFs with frame snapshots, detection tables, and VLM summaries, uploaded to Cloudflare R2 and emailed to configured contacts
  • Security Copilot — Conversational AI interface over live camera data powered by Gemini 2.5 Flash + Supermemory semantic search

How we built it

Frontend: Next.js 14 with a brutalist terminal aesthetic. Features a live multi-camera WebSocket dashboard, video upload analysis page, and incident reports viewer.

Backend: FastAPI server managing stream state, a dedicated WebSocket server on port 9001 for JPEG frame relay, Firebase Realtime Database persistence, and Cloudflare R2 for asset storage.

Vision pipeline (Modal serverless GPUs):

  • FastVisionAPI — YOLOv11 weapon detection + Two-Stream brawl model with Farneback optical flow + YOLO-pose fall detection, batched in sequences of 5 frames
  • Qwen2-VL-7B — Deep video summarization for incident reports

Audio pipeline (Modal serverless GPUs):

  • ReyvazDetector — LSTM/GRU model on mel spectrograms with Bayesian posterior fusion for gunshot/glassbreak classification
  • Qwen2-Audio-7B — Contextual threat narration and officer recommendations

Notifications: Cloudflare Workers email proxy with per-alert-type routing and cooldown management.

Extra Cloudflare: Cloudflare R2 Storage to store reports and video logs of alerts of weapons and violence. Cloudfare Realtime for camera live streaming to the AI Inference models.

Challenges we ran into

  • False positives on violence detection — The brawl model reads optical flow magnitudes, so someone jogging with a bag scored nearly identically to a fight. Fixed with a weapon dominance gate: if weapons appear in ≥25% of sampled frames, the brawl model is skipped entirely.
  • Optical flow continuity across batches — Each 5-frame batch was resetting flow context, causing motion spikes at every seam. Fixed by persisting prev_gray per stream ID in Modal server-side state.
  • Weapon label instability — Gun/knife classification flickering frame-to-frame. Fixed with a sliding window label smoother with 2.5× weight on "Gun" votes and a hard lock at 0.45 confidence.
  • Clip format compatibility — Live cameras produced AVI files that browsers couldn't play. Added an ffmpeg transcode step to MP4 on the backend.

Accomplishments that we're proud of

  • End-to-end incident pipeline — Threat detected → clip saved → PDF generated → uploaded to R2 → persisted to Firebase → emailed to contacts, all automatically within seconds
  • Multi-camera grid with real-time WebSocket streaming, detection overlay rendering, and isolated per-stream AI state running simultaneously
  • Security Copilot combining live Firebase event data with Supermemory semantic retrieval — a conversational interface over your own camera history
  • Works entirely on existing CCTV infrastructure — no new hardware required

What we learned

  • Sequence-level and frame-level detections serve different purposes — YOLO gives precise per-frame bounding boxes, but the violence CNN needs temporal context. Trying to merge them in a single inference call caused architectural issues that were only resolved by treating them as parallel pipelines with a priority arbitration layer.
  • Clip quality matters for VLM accuracy — Qwen2-VL's analysis was significantly more useful when we selected the brightest representative frame rather than just the first frame (which was often black during stream startup).

What's next for Akawa

  • Automated 911 / emergency dispatch calling — On a confirmed threat, Akawa would automatically call emergency services with a synthesized briefing: threat type, camera location, confidence level, and a live VLM description of the event. No human needs to pick up a phone first. We had this scoped out during the hackathon but automating outbound calls requires enterprise-tier subscriptions and prior regulatory approval from providers like Twilio, vetting that isn't completable in a weekend. Once approved, it slots directly into the existing notifications.py pipeline alongside the current email alerts.

Built With

Share this project:

Updates