CADence

Inspiration

We kept coming back to the same frustration: 3D CAD software is absurdly hard to learn. Fusion 360, SolidWorks, OnShape — they're incredible tools built for engineers with years of training. But what about the hobbyist who just wants to 3D-print a case for their Raspberry Pi? They have the idea in their head. They could probably describe it to you in 30 seconds. But turning that description into actual geometry means learning a complex UI, mastering constraint sketches, and wrestling with feature trees.

Meanwhile, AI can now understand natural language at a level that would've sounded like science fiction three years ago. And MediaPipe can track your hand in real-time from a laptop webcam. We thought: what if we just... connected these things? What if you could talk to your CAD software and wave at your 3D viewport like it was a hologram?

The pitch that got us started: "Upload a Raspberry Pi spec sheet, say 'build me a case for this,' and watch the AI agent read mounting hole positions, board dimensions, and clearances from the datasheet to build a correctly-dimensioned enclosure — then inspect it with your hands."

That's the product we set out to build.

What It Does

CADence is a fully browser-based 3D CAD editor with two radical inputs: your hands and your voice.

Navigate with gestures:

  • Finger gun (thumb + index extended) → orbit the camera around your model
  • Pinch + three open fingers → pan the viewport
  • Closed fist held for 1 second → reset camera to default view
  • Raise your index finger → push-to-talk, start recording a voice command

Design with voice:

  • "Add a box, 85 by 56 by 3 millimeters"
  • "Put four cylinders at the mounting hole positions from the spec"
  • "Subtract them to make through-holes"
  • "Make the case hollow with 2mm wall thickness"

Leverage real engineering data:

  • Upload a PDF datasheet and the system extracts dimensional constraints, mounting positions, clearances, and interface specs
  • The AI agent actively queries these constraints while designing — it doesn't guess dimensions from training data, it looks them up

Everything runs in the browser. The geometry engine, the 3D renderer, the hand tracking — all client-side. The AI agent runs server-side and streams its thought process and tool calls back in real time via SSE. You literally watch it think and build.

How We Built It

The Architecture Problem

The first real decision was architectural, and it shaped everything. We needed:

  • A geometry engine (JSCAD) that produces actual solid models
  • A 3D renderer (Three.js) for visualization
  • An AI agent (Claude) for understanding commands
  • Hand tracking (MediaPipe) for viewport control
  • All of it in a browser

The problem: Claude can't run in a browser, and you can't ship JSCAD's geometry engine to a Python backend without enormous complexity. So we designed a hybrid split-state architecture:

  • Frontend owns geometry truth — JSCAD runs in a Web Worker, Three.js renders meshes
  • Backend owns metadata truth — FastAPI holds an abstract scene model (positions, labels, types, bounding boxes)
  • Backend orchestrates the agent — Claude runs server-side, emits tool calls that stream to the frontend for execution

The data flow for a single voice command like "add a cylinder":

$$ \text{Voice} \xrightarrow{\text{POST}} \text{Whisper} \xrightarrow{\text{transcript}} \text{Claude} \xrightarrow{\text{SSE tool_call}} \text{JSCAD Worker} \xrightarrow{\text{POST result}} \text{Claude} \xrightarrow{\text{SSE done}} \text{Ready} $$

One HTTP request, one SSE stream, one tool-result POST per operation. The user sees each step live in the chat panel.

Tech Stack

Frontend: React 18 + Vite, TypeScript, Tailwind + shadcn/ui, Three.js (direct — not React Three Fiber), JSCAD in a Web Worker, MediaPipe HandLandmarker, earcut for triangulation

Backend: FastAPI + uvicorn, Claude Opus 4.6 (design agent), Claude Sonnet 4.6 (design review), Whisper API (speech-to-text), OpenDataLoader (PDF parsing), SSE-Starlette for streaming

The Geometry Pipeline

JSCAD produces solid geometry as polygon soups. Three.js needs indexed triangle buffers. Bridging them required:

  1. JSCAD geom3toPolygons() → polygon list
  2. For each polygon: compute normal via Newell's method, project vertices to 2D
  3. Earcut triangulation (handles concave polygons from boolean ops)
  4. Pack into Float32Array positions + normals
  5. Transfer as Transferable to main thread → BufferGeometryMesh

And the coordinate system mismatch: JSCAD is Z-up, Three.js is Y-up. Every vector crossing the boundary gets remapped:

$$ \text{JSCAD} \rightarrow \text{Three.js}: \quad (x, y, z) \mapsto (x, z, -y) $$

We enforced a single transform boundary in engine.ts to prevent double-conversion bugs.

The Gesture System

MediaPipe gives us 21 hand landmarks at ~30fps. We classify gestures using geometric heuristics on those landmarks:

  • Thumb extension: $\frac{d(\text{tip}_4, \text{wrist}_0)}{d(\text{MCP}_2, \text{wrist}_0)} > 1.3$ (filters out resting thumb)
  • Finger extended: $\text{tip}_y < \text{MCP}_y$ (tip above knuckle)
  • Pinching: $d(\text{thumb}_4, \text{index}_8) < 0.06$ (normalized distance)

Camera control uses exponential smoothing on palm position:

$$ \mathbf{p}{t} = \alpha \cdot \mathbf{p}{\text{raw}} + (1 - \alpha) \cdot \mathbf{p}_{t-1}, \quad \alpha = 0.2 $$

With a dead zone of $\epsilon = 0.004$ below which deltas are zeroed to kill jitter.

The Agent Loop

Claude operates as "Cadence," a design agent with 13 CAD tools (add_primitive, subtract, union, intersect, move, rotate, scale, delete, set_color, rename, clone, linear_pattern, design_review) plus 4 constraint query tools.

The agent loop is a while True until stop_reason == "end_turn". Multiple tool calls per turn are supported — all results are batched into a single user message with multiple tool_result blocks. Object IDs are server-assigned (obj_1, obj_2, ...) and injected into parameters before streaming to the frontend.

The Constraint Pipeline

When you upload a PDF:

  1. OpenDataLoader converts it to structured JSON + Markdown (with embedded images for diagram pages)
  2. Claude extracts typed constraints: dimensional, mounting, clearance, interface, material, electrical, thermal
  3. Constraints are stored with feature tags for semantic search
  4. During design, the agent queries constraints by category or keyword before choosing dimensions

This is the key differentiator — the agent doesn't hallucinate "standard Raspberry Pi dimensions." It looks them up from your actual datasheet.

Challenges We Faced

The Fundamental Tension: CAD Precision vs. LLM Unpredictability

CAD design is one of the most unforgiving domains you can point an LLM at. Engineering drawings deal in absolutes — a mounting hole is 3.5mm in diameter at exactly coordinates $(58, 49)$, or the screw doesn't fit. Tolerances are measured in fractions of a millimeter. A constraint violation doesn't produce a "slightly wrong" design — it produces a physically non-functional part.

LLMs, by their nature, are the opposite. They hallucinate regularly. They approximate. And critically, their performance degrades as context increases — the more information you feed them, the less reliably they use it. This creates a direct conflict with CAD workflows where:

  • Spec sheets can be dozens of pages with hundreds of dimensional constraints
  • Complex designs accumulate a long history of operations, object IDs, and spatial relationships
  • Every single value matters — you can't afford the model "forgetting" that a clearance requirement exists because it's buried 40 messages deep in the conversation

Naively dumping a full PDF spec sheet into Claude's context and saying "design something that fits this" is a recipe for hallucinated dimensions, violated constraints, and context bloat that makes every subsequent operation less reliable.

Our Solution: Tool Harness + Memory Management

Instead of treating the LLM as an omniscient designer that holds the entire problem in its head, we built a system that keeps the model's context lean and makes every edit deterministic.

Constraint extraction over context stuffing. When a user uploads a spec sheet, we don't pass the raw PDF to the agent. We run a separate extraction pass that pulls out structured constraints — typed, categorized, and tagged. During design, the agent queries specific constraints on demand via tool calls (get_constraints_by_category, search_constraints). The agent's context never contains the full spec sheet — just the 2-3 constraints relevant to the current operation.

Deterministic tool calls over freeform generation. We tried having Claude output raw geometry code. It hallucinated API calls, invented parameters, and produced dimensionally wrong shapes. The fix was constraining all edits to a closed set of deterministic tools — add_primitive, subtract, union, move_object, etc. Each tool has a strict schema. The LLM selects the tool and provides parameters, but the actual geometry operation is executed by a deterministic engine (JSCAD). The model can't produce an invalid boolean operation or a malformed mesh — the tool either succeeds or returns an error.

Conversation trimming with summarization. The agent's conversation history is capped at 40 entries. At the boundary, older messages are replaced with a compressed summary of what's been built so far — object IDs, positions, relationships. This prevents the context from growing unboundedly as designs get more complex, while preserving the metadata the agent needs for subsequent operations.

Server-assigned IDs eliminate referencing errors. Object IDs (obj_1, obj_2, ...) are assigned server-side and injected into tool call parameters before they reach the frontend. The agent never invents an ID — it receives them in tool results. This removes an entire class of hallucination where the model fabricates or confuses object references.

The result: the LLM operates with a lean, curated view of its working environment. It sees the current scene state, can look up constraints as needed, and expresses design intent through validated tool calls. The system around it handles everything that requires precision.

Why This Matters Beyond Our Project

This is the core design pattern for putting LLMs into precision-critical workflows: don't ask the model to be precise — build a harness that makes imprecision impossible. Extract structured data so the model doesn't carry raw documents. Constrain outputs to deterministic tool calls so the model can't produce invalid results. Manage context aggressively so performance doesn't degrade as the task grows.

CAD was our proving ground, but the same tension — stochastic AI meets deterministic requirements — exists in circuit design, structural engineering, pharmaceutical formulation, and any domain where constraints are non-negotiable.

What We Learned

Split-state architectures are powerful but fragile. When your geometry truth lives in a browser Web Worker and your metadata truth lives on a Python server, anything can drift. We had to build reconciliation: every request sends the current object ID manifest, and the backend diffs it against its model.

Agentic tool use beats prompt engineering for structured tasks. We tried having Claude output JSCAD code directly. It hallucinated API calls that don't exist. Switching to a constrained tool set with server-assigned IDs eliminated almost all hallucination. The agent can only do what the tools allow.

Gesture recognition is a signal processing problem, not a classification problem. Getting the right gesture 95% of frames is useless if the other 5% causes false state transitions. Smoothing, dead zones, and debouncing are where the actual usability lives.

Real-time bidirectional streaming creates a magical UX. Watching Claude reason ("I'll create the base plate first, then add mounting posts..."), then seeing each shape appear in the viewport as it emits tool calls — that loop is what makes the product feel alive. It's worth the engineering complexity of SSE + tool result POST.

PDF constraint extraction is undersold. Everyone talks about chatting with PDFs. Using extracted data as live constraints during a design task is a fundamentally different and more useful pattern. The agent doesn't summarize the datasheet — it uses it as a lookup table while building.

What's Next

  • Two-hand zoom gesture — pinch with both hands, change distance to dolly
  • Real-time collaborative sessions — multiple users gesture-navigating and voice-commanding the same scene
  • Constraint visualization — overlay extracted dimensions and clearances directly on the 3D model
  • Undo/redo with voice — "undo the last three steps"
  • Export to more formats — STEP, OBJ, 3MF alongside STL
  • On-device voice processing — eliminate the Whisper API round-trip with browser-local transcription

The vision: anyone with a webcam and an idea should be able to design a 3D-printable part in under five minutes, with engineering-grade dimensional accuracy from real spec sheets.

Built With

Share this project:

Updates