Inspiration

We all have seen how wearable tech like the Apple Watch or the FitBit pushed the boundaries of personal computing, and knowing how fast technology is moving, especially in this day and age, we wanted to look towards the near future. As compute gets smaller and more common, we asked ourselves what kind of capabilities we might be carrying with us every day? We realized that AR glasses offer a unique opportunity to tackle challenges directly through from the perspective of the user. We focused heavily on accessibility challenges and health-centric tasks that might feel like a chore. We wanted to build a platform that doesn’t just store information for the user, but acts as a proactive accessibility aid and a memory aid for those who need it most.

What it does

Cleo is an always-on, AI-powered assistant that lives in a pair of AR glasses, designed with accessibility and memory retention challenges in mind. To name a few of our programmed skills, we have a real-time narrator for the visually impaired that details the environment around them, along with a color blind correction tool. An object tracking skill helps users to remember where they last placed one of their precious objects, displaying its last seen location, while a facial recognition skill helps users with Alzheimer's or memory loss identify their loved ones. In addition to these skills that target accessibility, we have many others that target convenience for the user, in the form of note-taking, saving recent events, and frictionless food tracking. In addition to those, we also integrated skills for recording long-running events and tracking the weather. All of these skills are voice-activated and designed to be very responsive and easy to use for the user.

How we built it

We designed Cleo as a distributed AR runtime, not a single app: a process orchestrator (services/main.py) boots a graph of Python microservices in dependency order (sensor -> data -> assistant -> frontend -> transcription -> video -> optional tools), health-checks each gRPC endpoint, and handles graceful shutdown/kill semantics so failures in one subsystem don’t crash the whole stack. We chose this architecture because real-time wearables need isolation between hardware I/O, AI inference, storage, and UI pipelines to keep latency predictable and recovery simple. At the contract layer, the system is defined with Protocol Buffers + gRPC (protos/*.proto) across sensor, transcription, assistant, data, frontend, and tool services. Media is transported with explicit chunking semantics (frame_id, chunk_index, is_last, encoding enums), and we use streaming RPCs heavily (StreamCamera, StreamAudio, TranscribeStream, StreamUpdates), so the system behaves like a live data bus rather than request/response polling. We also built chunk assemblers and strict gap detection on both the Python and Rust/Tauri sides, so corrupted or out-of-order frames fail fast rather than silently causing bad UI/AI behavior. For hardware, we built a Rust + PyO3 bridge (viture-sensors) to expose VITURE device primitives to Python (camera/mic/IMU abstractions). SensorService continuously captures camera/audio, keeps ring buffers, publishes fan-out streams via a custom BroadcastHub, and encodes frames as H.264/JPEG. We implemented persistent FFmpeg encoders/decoders and transport helpers (camera_transport.py) to avoid per-frame process spawn overhead and support low-latency conversion between RGB/JPEG/H.264/MP4 throughout the pipeline. Our speech/intent path is a persistent streaming loop: audio from SensorService is pushed into Amazon Transcribe Streaming in TranscriptionService, which emits partial/final transcripts with speaker labeling and utterance IDs. On top of ASR (Automatic Speech Recognition) we built a custom TriggerRouter that detects wake phrases (“hey cleo”), captures a bounded context window around the trigger, handles speaker-aware follow-up windows, and forwards only the relevant snippet into the assistant. Final transcripts are automatically persisted in DataService, and optional debug transcript overlays can be pushed to the HUD via frontend RPCs. The assistant itself is a separate gRPC service using Amazon Bedrock Converse (Claude Sonnet) with custom orchestration logic: every command includes live vision context by capturing a current frame from SensorService, conversation state is maintained with expiry/trim logic, and follow-up utterances are classifier-gated before invoking tools. For tool use, we implemented dynamic function-calling style routing: ToolRegistry queries DataService for enabled apps, converts each app’s JSON schema into Bedrock toolSpec definitions, and lets the model choose tool + arguments at runtime. This gives us a plug-in architecture where adding a tool does not require hardcoding it into the assistant core. Tooling is implemented as independent gRPC microservices on a shared ToolService protocol (Execute(tool_name, parameters_json)), with a reusable ToolServiceBase that handles JSON parameter parsing, schema registration, and error framing. Each tool self-registers in DataService (RegisterApp) on boot, including description, type (on_demand vs active), gRPC address, and schema. That lets orchestration remain declarative and model-visible. Existing tools include weather, navigator, note-taking, food macros, color-blind assist, face detection, save-video clipping, full recording sessions, and item register/locate. On AWS, we intentionally use multiple specialized services instead of one model for everything: Amazon Transcribe Streaming for low-latency ASR + speaker diarization. Amazon Bedrock Claude Sonnet for assistant reasoning + tool selection, plus multimodal scene interpretation in navigator/food/item flows. Amazon Bedrock Nova multimodal embeddings for vector memory over text/image/video. Amazon Rekognition for face detection before local embedding/dedup workflows.

Memory and recall are handled by DataService, which combines SQLite + FAISS + object storage on disk: SQLite stores structured records (transcripts, clips, apps, preferences, faces, note summaries, food macros, recordings), while FAISS (IndexFlatIP cosine on normalized vectors) indexes multimodal embeddings for semantic retrieval and tracked-item lookup. Video clips are ingested as chunked uploads, embedded, indexed, and linked back to metadata; face data uses grouped matching + ambiguity thresholds + cooldown logic to avoid duplicate identities. We also expose a website-facing BFF (website_api.py) that translates HTTP/JSON endpoints into DataService gRPC calls for browser clients. Video memory is continuously built by VideoService: it subscribes to camera streams, assembles frames, rolls timed clip windows, converts H.264 payloads to MP4, downsamples clips for embedding efficiency, and streams clip + embedding payloads to DataService. Separate recording/save-video tools then reuse that memory layer to either clip “what just happened” from ring buffers or compose longer session recordings by concatenating overlapping clips and re-indexing the result. For UI, we use a React + TypeScript HUD inside Tauri, with a Rust gRPC client (tonic) subscribed to FrontendService.StreamUpdates. Backend services push typed DisplayUpdate messages (text, cards, images, progress, throbber, notifications, audio, HTML overlays, app indicators), Rust translates them into window events, and React components render layered overlays optimized for AR consumption. Audio playback and routing are managed in Rust (CPAL), including device selection persistence and cancellation tokens, while TTS audio is synthesized server-side and streamed into the HUD pipeline.

Challenges we ran into

Working directly with the AR glasses hardware revealed that our initial design assumptions were too rigid, forcing us to overhaul our architecture multiple times. To achieve the exact flexibility and low latency required, we scrapped out-of-the-box solutions and built a custom API layer from scratch to interface directly with the data streams. Additionally, as we brainstormed more accessibility skills, our monolithic codebase quickly became a bottleneck, making it incredibly difficult to track down long-running bugs within that tight coupling. We solved this by aggressively decoupling, splitting the skills into dynamically registered tools, which kept development incredibly fluid and allowed us to continuously plug in new features without breaking the core application.

Accomplishments that we're proud of

We shipped a true multi-service runtime: independent Python processes for sensor capture, transcription, assistant orchestration, data/memory, frontend relay, and video ingestion, all coordinated with health checks, startup dependencies, reconnect logic, and graceful shutdown behavior. We built a strongly typed gRPC/protobuf backbone across the entire system, including streaming RPCs and chunk-safe media transport (frame_id, chunk_index, is_last, encoding metadata), which gave us reliable real-time camera/audio delivery and predictable contracts between services. We implemented a custom AI orchestrator that combines wake-word routing, follow-up classification, conversation-state windows, and Bedrock tool-use/function-calling with dynamic JSON schemas, so the assistant can reason over available tools at runtime instead of hardcoded intent routing. We created a plug-in tool platform where each tool is its own gRPC microservice, self-registers into DataService with schema + metadata, and is auto-discoverable by the assistant; that architecture let us add complex capabilities (navigation, recording, note-taking, weather, food macros, color assist, face detection, item tracking) without rewriting assistant core logic. We integrated multiple AWS AI services with clear role separation: Transcribe Streaming for low-latency Automatic Speech Recognition + diarization, Bedrock models for reasoning and multimodal analysis, Nova multimodal embeddings for semantic memory, and Rekognition for face detection, then stitched them into one low-latency data flow. We built a multimodal memory stack that joins SQLite (structured state), FAISS cosine indexes (semantic retrieval), and clip storage, enabling text/image/video search, tracked-item recall, face grouping, note summaries, and replayable recording history from the same data layer. We solved difficult media systems problems in production-like conditions: persistent FFmpeg encode/decode paths, H.264/JPEG/RGB conversion, rolling ring buffers, MP4 clip composition/concatenation, chunked upload/download over gRPC, and embedding-friendly downsampling. We delivered a full AR presentation pipeline: a Tauri + React HUD with a Rust gRPC client that subscribes to typed DisplayUpdate streams and renders notifications/cards/progress/images/audio/HTML overlays in real time, creating an end-to-end loop from sensor input to on-glasses feedback.

What we learned

We learned how to develop a large project comprising dozens of services communicating over gRPC. We learned how to leverage AI tools, context, and instructions to rapidly scale and build our system. We learned how to make a system that utilizes multiple programming languages simultaneously.

What's next for Cleo

Currently, Cleo relies on a local laptop to run its backend services and process intensive video and audio streams. Our immediate next step is to completely untether the user by migrating this compute to a smartphone companion app. By acting as a lightweight edge gateway, the mobile phone can manage the hardware connection while shifting the heavy processing to the cloud. Additionally, because we architected our tool-calling framework to be fully decoupled, the platform is inherently ready for open-source contribution. Developers worldwide can easily build and register nearly any new skill, continuously expanding Cleo's capabilities for our users.

Share this project:

Updates