ToolFinder

Inspiration

Construction and workshop professionals lose an average of 38 hours per year searching for tools — nearly an entire workweek wasted before work even begins. In some environments, workers report spending up to 47% of their time locating tools instead of performing tasks. Across industries, employees spend roughly 25% of their day searching for information or equipment rather than producing value.

That inefficiency scales rapidly.

A 10-person workshop can lose hundreds of hours annually just looking for objects that are physically present in the room.

The ToolFinder is built for frontline workers — construction crews, repair technicians, warehouse operators, lab teams, and hardware engineers — professionals who work in dynamic, cluttered, fast-paced environments where every minute matters. These workers often have the least access to advanced AI systems, despite facing the highest operational friction.

We asked:

What if a worker could simply say,
"Where is my drill bits?"
and the workspace itself responded instantly?

That idea became The ToolFinder — a hands-free, voice-activated AI assistant that identifies, segments, and highlights requested objects in real time, returning pixel-precise masks and physical coordinates.

At its core, The ToolFinder is powered by Modal's serverless GPU infrastructure, enabling us to orchestrate multiple large AI models in parallel without managing any hardware.

What it does

The ToolFinder is a multimodal AI system that:

Accepts natural speech from a user.
Converts speech to structured detection queries.
Captures a live camera frame.
Sends the frame to a Modal-hosted GPU detection pipeline.
Runs parallel GPU-accelerated detection and segmentation.
Returns an annotated image with:
- Pixel-precise masks
- Confidence scores
- Object centroids
Optionally directs a physical laser pointer to the object.

Concrete Example of How It Works

User says:

"I need to find the key and clean up my space."

The system performs the following transformations:

Step 1 – Transcription Output

"I need to find the key and clean up my space"

Step 2 – Semantic Mapping Output

Allen key
Clutter

The semantic router interprets:

"key" → Allen key (contextual workshop mapping)
"clean up" → Clutter (intent-based mapping)

Each line becomes an independent detection job.

Step 3 – Detection Output (from Modal GPU backend)

{
  "image": "<base64 PNG>",
  "detections": [
    {"label": "Allen key", "score": 0.88, "cx": 310, "cy": 224},
    {"label": "Clutter", "score": 0.76, "cx": 540, "cy": 190}
  ],
  "count": 2
}

Step 4 – Visual Result

Allen key highlighted in green.
Clutter items highlighted in cyan.
Mask contours drawn.
Confidence percentage rendered near each object.

Instead of keyword matching, the system performs semantic understanding.

Example mappings:

User Speech	Structured Output
"Hand me the flathead you're holding"	Screwdriver
"Where is the green case?"	Screwdriver Kit
"Pass me the motor controller"	Motor Controllers
"Where's my hammer?"	Hammer (open-vocabulary)

If the object has never been explicitly trained:

"Where is my hammer?"

Semantic output:

Hammer

That request is routed through Modal to an open-vocabulary segmentation model capable of detecting objects purely from text prompts.

This is not a hardcoded detection. It is contextual semantic routing combined with GPU-scale multimodal inference.

How we built it

The ToolFinder is architected as a five-layer multimodal pipeline, with Modal acting as the compute backbone that orchestrates the GPU detection layer.

1️⃣ Speech Input Layer

Voice is captured in two modes:

Local microphone mode

silero-VAD continuously processes 512-sample chunks at 16kHz.
When confidence > 0.65, audio is buffered.
After ~600ms silence, buffered audio is sent to Whisper (base).
Minimum speech duration (~800ms) prevents false triggers.

Browser mode

Web Speech API transcribes in real-time.
Transcript is sent to backend:

POST /detect
{
  "transcript": "Where are my drill bits?"
}

Both produce a clean transcript string.

2️⃣ Semantic Mapping Layer

The transcript is sent to Gemini 2.5-Flash, which acts as a semantic router.

It performs:

Entity extraction
Context disambiguation
Intent recognition
Class normalization

Example transformation:

Input:

"I want to put my sensors back, where is the case?"

Output:

Sensor Case

Input:

"Where is my soldering iron?"

Output:

Soldering Iron

Each output line becomes a detection job sent to the Modal GPU backend.

3️⃣ Frame Capture Layer

Frame acquisition priority:

Test image (offline mode).
ESP32 camera stream (primary).
Local webcam fallback.

ESP32 TCP protocol:

[4 bytes: uint32 payload length]
[L bytes: JPEG image data]

Backend reconstructs:

img = Image.open(io.BytesIO(payload_bytes)).convert("RGB")

The reconstructed image is forwarded to Modal's GPU pipeline.

4️⃣ GPU Detection Pipeline (Modal Serverless Infrastructure)

This layer runs entirely on Modal.

Modal enables us to:

Provision A10G and H100 GPUs on demand
Load large models once at container startup
Run parallel detection tasks per request
Maintain warm containers for low latency
Scale inference without managing servers

Model	GPU (Modal)	Purpose
YOLO (custom)	A10G	Structured object detection
SAM2 ViT-Large	A10G	Pixel-precise mask refinement
SAM3	H100	Open-vocabulary segmentation
Gemini 2.5-Flash	API	Semantic routing
Whisper	CPU	Speech transcription

Dynamic Routing (executed on Modal)

Structured classes → YOLO + SAM2 refinement
Unknown objects → SAM3 open-vocabulary segmentation
Hybrid queries → parallel execution, overlay compositing

Parallel inference is handled via Modal's containerized GPU execution model.

Detection Mathematics

Bounding Box Conversion

YOLO outputs:

$$ (x_c, y_c, w, h) $$

Converted to:

$$ x_1 = x_c - \frac{w}{2}, \quad x_2 = x_c + \frac{w}{2} $$

$$ y_1 = y_c - \frac{h}{2}, \quad y_2 = y_c + \frac{h}{2} $$

Mask Blending

$$ \text{canvas}[mask] = 0.55 \cdot \text{canvas}[mask] + 0.45 \cdot \text{color} $$

Centroid Calculation

$$ ys, xs = \text{where}(mask) $$

$$ c_x = \text{mean}(xs) $$

$$ c_y = \text{mean}(ys) $$

This ensures centroids lie inside irregular shapes — crucial for physical servo pointing.

5️⃣ Hardware Extension – Laser Pointer Rig

The centroid coordinates are fed into a dual-servo laser system.

Yaw rotation matrix:

$$ R_y = \begin{bmatrix} \cos \theta & -\sin \theta & 0 \ \sin \theta & \cos \theta & 0 \ 0 & 0 & 1 \end{bmatrix} $$

Ray-plane intersection:

$$ t = -\frac{z_0}{d_z} $$

$$ x = x_0 + t \cdot d_x $$

$$ y = y_0 + t \cdot d_y $$

Inverse kinematics are solved via numerical optimization.

Challenges we ran into

Coordinating multimodal AI components across Modal GPU containers
Managing parallel A10G + H100 inference
Reducing cold start latency
Handling cluttered, overlapping objects
Designing robust semantic routing
Synchronizing frontend and backend devices

Accomplishments that we're proud of

Fully hands-free AI interaction
Parallel serverless GPU orchestration using Modal
Real-time multimodal inference
Pixel-level segmentation
Open-vocabulary fallback
Physical servo pointing integration

We built a spatially-aware AI system operating in real physical space — powered by serverless GPU infrastructure.

What we learned

Multimodal AI meaningfully reduces frontline friction.
Semantic routing is more powerful than keyword detection.
Mask centroids are physically meaningful.
Serverless GPU infrastructure (Modal) makes advanced real-time AI deployable in hours, not weeks.

What's next for The ToolFinder

Mobile deployment
On-device inference
AR overlays
Inventory tracking integration
Multi-camera fusion
Predictive workspace optimization

The ToolFinder is a step toward AI that works alongside frontline professionals — on the job, in motion, and in the flow of work — powered by scalable serverless GPUs.