Inspiration

Construction and workshop professionals lose an average of 38 hours per year searching for tools — nearly an entire workweek wasted before work even begins. In some environments, workers report spending up to 47% of their time locating tools instead of performing tasks. Across industries, employees spend roughly 25% of their day searching for information or equipment rather than producing value.

That inefficiency scales rapidly.

A 10-person workshop can lose hundreds of hours annually just looking for objects that are physically present in the room.

The ToolFinder is built for frontline workers — construction crews, repair technicians, warehouse operators, lab teams, and hardware engineers — professionals who work in dynamic, cluttered, fast-paced environments where every minute matters. These workers often have the least access to advanced AI systems, despite facing the highest operational friction.

We asked:

What if a worker could simply say,
"Where is my drill bits?"
and the workspace itself responded instantly?

That idea became The ToolFinder — a hands-free, voice-activated AI assistant that identifies, segments, and highlights requested objects in real time, returning pixel-precise masks and physical coordinates.

At its core, The ToolFinder is powered by Modal's serverless GPU infrastructure, enabling us to orchestrate multiple large AI models in parallel without managing any hardware.


What it does

The ToolFinder is a multimodal AI system that:

  1. Accepts natural speech from a user.
  2. Converts speech to structured detection queries.
  3. Captures a live camera frame.
  4. Sends the frame to a Modal-hosted GPU detection pipeline.
  5. Runs parallel GPU-accelerated detection and segmentation.
  6. Returns an annotated image with:
    • Pixel-precise masks
    • Confidence scores
    • Object centroids
  7. Optionally directs a physical laser pointer to the object.

Concrete Example of How It Works

User says:

"I need to find the key and clean up my space."

The system performs the following transformations:

Step 1 – Transcription Output

"I need to find the key and clean up my space"

Step 2 – Semantic Mapping Output

Allen key
Clutter

The semantic router interprets:

  • "key" → Allen key (contextual workshop mapping)
  • "clean up" → Clutter (intent-based mapping)

Each line becomes an independent detection job.

Step 3 – Detection Output (from Modal GPU backend)

{
  "image": "<base64 PNG>",
  "detections": [
    {"label": "Allen key", "score": 0.88, "cx": 310, "cy": 224},
    {"label": "Clutter", "score": 0.76, "cx": 540, "cy": 190}
  ],
  "count": 2
}

Step 4 – Visual Result

  • Allen key highlighted in green.
  • Clutter items highlighted in cyan.
  • Mask contours drawn.
  • Confidence percentage rendered near each object.

Instead of keyword matching, the system performs semantic understanding.

Example mappings:

User Speech Structured Output
"Hand me the flathead you're holding" Screwdriver
"Where is the green case?" Screwdriver Kit
"Pass me the motor controller" Motor Controllers
"Where's my hammer?" Hammer (open-vocabulary)

If the object has never been explicitly trained:

"Where is my hammer?"

Semantic output:

Hammer

That request is routed through Modal to an open-vocabulary segmentation model capable of detecting objects purely from text prompts.

This is not a hardcoded detection. It is contextual semantic routing combined with GPU-scale multimodal inference.


How we built it

The ToolFinder is architected as a five-layer multimodal pipeline, with Modal acting as the compute backbone that orchestrates the GPU detection layer.


1️⃣ Speech Input Layer

Voice is captured in two modes:

Local microphone mode

  • silero-VAD continuously processes 512-sample chunks at 16kHz.
  • When confidence > 0.65, audio is buffered.
  • After ~600ms silence, buffered audio is sent to Whisper (base).
  • Minimum speech duration (~800ms) prevents false triggers.

Browser mode

  • Web Speech API transcribes in real-time.
  • Transcript is sent to backend:
POST /detect
{
  "transcript": "Where are my drill bits?"
}

Both produce a clean transcript string.


2️⃣ Semantic Mapping Layer

The transcript is sent to Gemini 2.5-Flash, which acts as a semantic router.

It performs:

  • Entity extraction
  • Context disambiguation
  • Intent recognition
  • Class normalization

Example transformation:

Input:

"I want to put my sensors back, where is the case?"

Output:

Sensor Case

Input:

"Where is my soldering iron?"

Output:

Soldering Iron

Each output line becomes a detection job sent to the Modal GPU backend.


3️⃣ Frame Capture Layer

Frame acquisition priority:

  1. Test image (offline mode).
  2. ESP32 camera stream (primary).
  3. Local webcam fallback.

ESP32 TCP protocol:

[4 bytes: uint32 payload length]
[L bytes: JPEG image data]

Backend reconstructs:

img = Image.open(io.BytesIO(payload_bytes)).convert("RGB")

The reconstructed image is forwarded to Modal's GPU pipeline.


4️⃣ GPU Detection Pipeline (Modal Serverless Infrastructure)

This layer runs entirely on Modal.

Modal enables us to:

  • Provision A10G and H100 GPUs on demand
  • Load large models once at container startup
  • Run parallel detection tasks per request
  • Maintain warm containers for low latency
  • Scale inference without managing servers
Model GPU (Modal) Purpose
YOLO (custom) A10G Structured object detection
SAM2 ViT-Large A10G Pixel-precise mask refinement
SAM3 H100 Open-vocabulary segmentation
Gemini 2.5-Flash API Semantic routing
Whisper CPU Speech transcription

Dynamic Routing (executed on Modal)

  • Structured classes → YOLO + SAM2 refinement
  • Unknown objects → SAM3 open-vocabulary segmentation
  • Hybrid queries → parallel execution, overlay compositing

Parallel inference is handled via Modal's containerized GPU execution model.


Detection Mathematics

Bounding Box Conversion

YOLO outputs:

$$ (x_c, y_c, w, h) $$

Converted to:

$$ x_1 = x_c - \frac{w}{2}, \quad x_2 = x_c + \frac{w}{2} $$

$$ y_1 = y_c - \frac{h}{2}, \quad y_2 = y_c + \frac{h}{2} $$


Mask Blending

$$ \text{canvas}[mask] = 0.55 \cdot \text{canvas}[mask] + 0.45 \cdot \text{color} $$


Centroid Calculation

$$ ys, xs = \text{where}(mask) $$

$$ c_x = \text{mean}(xs) $$

$$ c_y = \text{mean}(ys) $$

This ensures centroids lie inside irregular shapes — crucial for physical servo pointing.


5️⃣ Hardware Extension – Laser Pointer Rig

The centroid coordinates are fed into a dual-servo laser system.

Yaw rotation matrix:

$$ R_y = \begin{bmatrix} \cos \theta & -\sin \theta & 0 \ \sin \theta & \cos \theta & 0 \ 0 & 0 & 1 \end{bmatrix} $$

Ray-plane intersection:

$$ t = -\frac{z_0}{d_z} $$

$$ x = x_0 + t \cdot d_x $$

$$ y = y_0 + t \cdot d_y $$

Inverse kinematics are solved via numerical optimization.


Challenges we ran into

  • Coordinating multimodal AI components across Modal GPU containers
  • Managing parallel A10G + H100 inference
  • Reducing cold start latency
  • Handling cluttered, overlapping objects
  • Designing robust semantic routing
  • Synchronizing frontend and backend devices

Accomplishments that we're proud of

  • Fully hands-free AI interaction
  • Parallel serverless GPU orchestration using Modal
  • Real-time multimodal inference
  • Pixel-level segmentation
  • Open-vocabulary fallback
  • Physical servo pointing integration

We built a spatially-aware AI system operating in real physical space — powered by serverless GPU infrastructure.


What we learned

  • Multimodal AI meaningfully reduces frontline friction.
  • Semantic routing is more powerful than keyword detection.
  • Mask centroids are physically meaningful.
  • Serverless GPU infrastructure (Modal) makes advanced real-time AI deployable in hours, not weeks.

What's next for The ToolFinder

  • Mobile deployment
  • On-device inference
  • AR overlays
  • Inventory tracking integration
  • Multi-camera fusion
  • Predictive workspace optimization

The ToolFinder is a step toward AI that works alongside frontline professionals — on the job, in motion, and in the flow of work — powered by scalable serverless GPUs.

Built With

Share this project:

Updates