Inspiration

The current landscape of AI is entirely turn-based. You send a prompt, you wait, and it responds. But reality isn't turn-based—it's continuous. We were inspired by the massive gap between human perception, which constantly processes an uninterrupted stream of reality, and AI, which remains frozen in time until summoned. We wanted to build the next evolution: a "Living AI." An intelligence that doesn't just wait for questions, but actively watches, monitors, and reacts to live data streams in real time, whether it's biometric health markers, fast-moving financial markets, or live security camera feeds.

What it does

Our project transforms a traditional Large Language Model into an always-on, autonomous agent. It features three core modes:

Health Monitoring: Watches live heartbeat data, instantly deploying emergency services via a call_911 tool if it detects critical anomalies. Algorithmic Trading: Continuously processes live stock prices, autonomously executing buy_stock and sell_stock commands based on its real-time market analysis. Security Surveillance: Processes live webcam feeds at ~3 FPS, utilizing a vision model to detect threats and execute a trigger_alarm tool.

How we built it

We flipped the standard chat architecture upside down. Instead of discrete request-response cycles, we built a continuous generation loop.

To make this possible without devastating latency, we engineered a custom inference engine that utilizes KV Cache Injection. When a live signal arrives (like a new stock price or a camera frame), we don't recalculate the entire context window. Instead, we tokenize the new data and seamlessly inject it directly into the model's active Key-Value (KV) cache. The AI instantly incorporates this new reality into its very next generated token.

Our stack consists of:

A Next.js frontend that captures local webcam data and handles the UI. A FastAPI backend running on Google Colab (exposed via Cloudflare Tunnels) to leverage high-end GPUs. Two 4-bit quantized models loaded simultaneously via bitsandbytes: Qwen3-14B for rapid text/numerical reasoning, and Qwen3-VL-8B for real-time visual processing. Persistent Server-Sent Events (SSE) for streaming the model's continuous stream of consciousness back to the client.

Challenges we ran into

KV Cache Memory Leaks: Because the model generates endlessly, the KV cache would eventually consume all available VRAM. We had to build a custom trim_kv_cache() function that surgically retains the vital system prompt tokens while discarding middle context, ensuring the model never exceeds our limit of 14,336. Concurrency Bottlenecks: Synchronizing FastAPI's async SSE streams with the heavy, blocking operations of LLM inference required careful threading and queue management to prevent the server from hanging during generation. Vision Latency: Processing live video through a Vision-Language Model in real time is incredibly taxing. We had to optimize the client-side frame capture, downscaling frames to 384x384 JPEG base64 strings, and throttling the frame rate to maintain a fluid streaming response.

What we learned

The power of persistent memory: Re-encoding context history is the biggest bottleneck in modern AI. By managing the KV cache directly, AI can run continuously with near-zero latency for new inputs. Tool usage in a continuous stream: We learned that when an AI is constantly "thinking out loud," enforcing strict schemas for tool calls is critical to prevent hallucinations and ensure actions are only triggered when confidence is absolute.

Built With

  • bitsandbytes
  • cloudflare-tunnels
  • fastapi
  • google-colab
  • huggingface
  • next.js
  • nextjs
  • python
  • qwen
  • react
Share this project:

Updates