EchoAI

Logo
System Architecture

Inspiration

We are living at the intersection of digital perception and physical reality. Technologies like computer vision and LiDAR are giving machines a superhuman "sixth sense," allowing them to navigate complex, dynamic environments. We've seen this power take the world by storm in autonomous vehicles, but we wanted to get our hands dirty and ask: how can we use this embodied intelligence to directly augment human capability?

For millions of visually impaired individuals, navigating the world is a daily series of high-stakes challenges. The traditional white cane, while essential, is an analog tool in an increasingly complex, high-speed digital world. It can't detect a head-level obstacle, a silent e-scooter, or the specific shape of a crowd. We saw an opportunity to bridge this gap. Our project is an attempt to fuse these advanced sensors, not to pilot a robot, but to act as a real-time haptic co-pilot for a person, translating the unseen world into a language they can feel.

What it does

EchoAI transforms a smartphone into an intelligent "co-pilot" for the visually impaired, offering a multi-sensory understanding of the world that goes far beyond a traditional cane. Our system is built on a high-performance client-server architecture, where an iPhone captures the world, and a powerful AMD-hosted server does the heavy lifting.

This split design allows the user's device to stay light and efficient while our server executes a suite of cutting-edge AI models in real-time. Here's what EchoAI delivers:

Real-time 3D Spatial Awareness: Using the iPhone's LiDAR scanner and camera, the system builds a live 3D map of the user's surroundings. Simultaneously, it runs advanced YOLO object detection models to identify critical obstacles like pedestrians, vehicles, and street furniture. This data is fused to provide a rich, spatial understanding of what an object is and where it is.
Predictive Motion Intent: EchoAI doesn't just see a car; it predicts where that car is going. By feeding object-tracking data into a Kalman filter, our system forecasts the motion intent of moving objects. This allows us to provide proactive haptic feedback, warning the user of a potential collision before it becomes an immediate danger.

All this complex information is processed and relayed back to the user in milliseconds, translated into a simple, intuitive language of haptic vibrations and spatial audio, granting a new level of autonomy and safety.

How we built it

EchoAI is a hybrid system composed of a native iOS client and a powerful Python-based backend, hosted on AMD hardware.

The Client (iOS): The user-facing application is built in Swift using ARKit. We use ARKit to access the iPhone's LiDAR scanner, which generates a real-time depth map of the environment. This depth data is processed locally for instantaneous collision detection, triggering haptic feedback for nearby obstacles.
The Networking: The client and server communicate over a custom WebSocket protocol. The Swift app continuously streams the camera feed to the server for analysis. The server then sends back a stream of JSON data (containing object coordinates, labels, and motion vectors).
The Server (Python & AI): On our AMD-powered server, we built a high-performance inference pipeline.
1. Detection: Incoming video frames are fed into a YOLOv11 model for real-time object detection.
2. Tracking & Prediction: The detected object bounding boxes are passed to a DeepSORT tracker, which assigns a unique ID to each object. These tracks are then fed into a Kalman Filter to predict the future trajectory and motion intent of each object.

The final piece of the puzzle happens back on the client, where the app fuses its local LiDAR depth data with the incoming JSON data from the server. This allows EchoAI to correctly place the server's AI-powered labels (like "car") onto the precise 3D-aware obstacles detected by the local LiDAR.

Challenges we ran into

A core challenge was bridging the gap between a lightweight iPhone client and our heavy-duty AI models. Running YOLO-based object detection and motion prediction simultaneously is impossible on a mobile device, so we had to offload this processing to our AMD server. This immediately created our biggest hurdle: latency. We weren't just sending small requests; we needed to engineer a custom networking protocol to stream a high-bandwidth camera feed to the server and get inference results back in milliseconds. Any lag would make the haptic warnings useless for navigating a real-world environment.

Our next major challenge was sensor fusion on the client itself. The iPhone was handling all the LiDAR-based collision detection locally, while our server was analyzing the video feed to identify what those obstacles were. We had to design a system in Swift that could instantly synchronize these two separate data streams: the local LiDAR's "where" (an obstacle at 2 meters) and the server's "what" (it's a car, not a person). Fusing this on-device spatial map with the real-time AI results from the server—and making it fast and reliable enough to trust—was a complex synchronization puzzle.

Accomplishments that we're proud of

We are incredibly proud of successfully building a real-time, multi-modal assistive device on a client-server architecture. Our biggest accomplishment was engineering the high-throughput networking pipeline. Getting a lightweight Swift app to stream live video to our AMD server, run multiple heavy-weight inference models—YOLOv8 and DeepSORT—and receive results back in time to provide meaningful, real-time haptic feedback was a massive win.

Beyond the networking, we successfully integrated two distinct, complex features into one coherent user experience:

Instant, on-device collision avoidance using the iPhone's LiDAR.
Server-side object detection and motion-intent prediction from the video feed.

Successfully synchronizing the local LiDAR data ("obstacle here") with the server's remote AI insights ("it's a car") on the client was a major breakthrough that brings this concept from a demo to a viable tool.

What we learned

The biggest lesson from this hackathon is that for next-generation embodied AI, the network is the new bottleneck. We learned that the theoretical power of a model is useless if you can't solve the data pipeline problem. We spent a significant portion of our time engineering a robust WebSocket protocol, optimizing frame data, and fighting latency at every step. A 500ms lag, acceptable for a web app, is a critical failure for a visually impaired user navigating a street.

We also learned the immense complexity of sensor fusion, even in a hybrid model. Making the client's local LiDAR map "trust" and "talk to" the remote AI insights from the camera feed was a profound architectural challenge. It taught us that system design, data synchronization, and state management are just as critical as the AI models themselves.

What's next for EchoAI

EchoAI is a powerful proof-of-concept, and we're just scratching the surface. Our roadmap is focused on creating a truly unified sensory experience:

True Server-Side Sensor Fusion: Our number one priority. Right now, the client fuses local LiDAR with remote CV. The next leap is to stream both the LiDAR point cloud and the video feed to the server. This will allow us to build a single, unified 3D-aware model that doesn't just see a "car" but understands its exact 3D volume, velocity, and trajectory in a single inference, which will revolutionize our motion prediction.
Optimized Networking: We will move from our current protocol to a more robust, low-latency standard like WebRTC to shave off critical milliseconds and make the haptic feedback feel truly instantaneous.
Expanded Environmental Awareness: With the server pipeline built, we can easily add more models. We plan to incorporate indoor navigation (visual positioning) and an "ambient" mode that can describe a user's surroundings ("you are in a park," "you are approaching a store entrance") to provide richer context beyond just obstacle avoidance.