Inspiration

The rise of IRL streaming giants like Kai Cenat and Adin Ross has transformed Twitch. It took content out of the bedroom and into the real world. However, this shift created a massive disconnect. When a streamer is out living life, walking through Tokyo or exploring a haunted house, they cannot be glued to their phone screen reading 100 messages a second.

This makes it incredibly hard to bond with the community. The "chat" becomes a blur of text rather than a group of people. We wanted to solve this by personifying the chaos. Pickle AI gives the "Hivemind" a single voice, a face, and a memory. It aggregates the sentiment of thousands of viewers into one intelligent virtual co-host that travels with you. It allows the streamer to interact with their community naturally, just like talking to a friend walking next to them.

What it does

Pickle AI is a modular, multi-modal companion that acts as the "collective brain" of a Twitch stream. It does not just read donations. It actively participates in the content.

  • It Listens: It monitors Twitch chat for questions and sentiment, and listens to the streamer's voice via a microphone.
  • It Sees: Using real-time computer vision, it analyzes the video feed to understand the physical environment (e.g., "I see you are holding a coffee cup").
  • It Speaks: It responds vocally using low-latency Text-to-Speech, engaging in banter or answering questions.
  • It Remembers: Unlike standard bots, it uses a RAG (Retrieval-Augmented Generation) memory system to recall details about specific viewers and past conversations ("Hey user123, how did that exam go last week?").

How we built it

We architected Pickle AI as a microservices-inspired system to handle the heavy load of real-time multi-modal processing:

  • The Brain (Backend): Built with Python and FastAPI. We use Ollama running locally (Phi-3.5) for privacy-first intelligence, avoiding expensive API costs.
  • The Memory (RAG): We implemented a hybrid memory architecture using ChromaDB. It combines a "Short-Term Memory" deque for immediate conversational flow with "Long-Term Memory" vector embeddings to store and retrieve user facts.
  • The Eyes (Vision Sidecar): Python lacks native browser APIs for WebRTC, so we built a Node.js + Puppeteer sidecar. It runs a headless browser to load the Overshoot AI SDK, which captures the video feed, analyzes it, and sends scene descriptions back to the Python core via WebSocket.
  • The Voice: We integrated Deepgram for both Speech-to-Text (STT) and Text-to-Speech (TTS) due to its incredible speed, routing audio through virtual cables to OBS.

Challenges we ran into

  • The "Blind" Python Problem: We wanted to use the Overshoot AI SDK for vision, but it is designed for browsers, not Python backends. We had to hack together a "Sidecar" solution where a hidden Node.js server launches a headless Chrome instance just to give the AI "eyes."
  • Context Window Limits: We could not just feed the LLM every chat message. We had to build a "Context Assembler" that intelligently selects only the most relevant messages and retrieves long-term memories based on semantic similarity. This ensures the AI stays on topic without blowing up the context window.
  • Impulse Control: Initially, the AI was too chatty. It responded to every single chat message. We developed a "Gating System" with probability triggers and cooldowns so it only speaks when it has something valuable to add, rather than talking over the streamer.

Accomplishments that we're proud of

  • True Local Intelligence: We successfully got the core LLM running entirely locally. This means zero lag from cloud inference and total privacy for the streamer's data.
  • Hybrid Memory System: We built a memory system that feels "human." The fact that our AI can remember a viewer's name from a stream three days ago and bring it up naturally is a huge leap over standard chatbots.
  • Latency Engineering: Balancing Vision, STT, LLM generation, and TTS in real-time is heavy. We managed to optimize the pipeline to keep response times low enough for natural banter.

What we learned

  • Prompt Engineering is UI: We learned that the "personality" of the AI is entirely dependent on how dynamic the system prompt is. Hard-coding rules did not work. Feeding the LLM a dynamic "current state" of the stream worked much better.
  • RAG > Context Window: Simply stuffing more text into an LLM makes it confused. Retrieving less but better context via Vector Search (ChromaDB) yielded far smarter responses.

What's next for Pickle

  • Multi-Persona Support: Allowing streamers to switch the AI's personality mid-stream (e.g., from "Helpful Guide" to "Roast Master") using chat commands.
  • Visual Avatar: Currently, it is a voice. We want to add a Live2D or 3D avatar that lip-syncs to the TTS for a true "VTuber co-host" experience.
  • Cloud Fallback: While we love local, we plan to add support for OpenAI or Anthropic APIs for users with lower-end hardware who cannot run local models efficiently.

Built With

Share this project:

Updates