Upload hardhat camera footage β Get productivity insights via chat. That's it.
Built in 48 hours for UMD x Ironsite Spatial Intelligence Hackathon (Feb 20-22, 2025)
Watch the demo: SiteIQ Demo Video
Construction supervisors watch hours of hardhat footage but can't answer:
- "Was the crew productive today?"
- "How much time was wasted searching for tools?"
- "What was the productivity during the critical 2-hour window?"
Current AI (ChatGPT, Claude) can describe what they see but can't quantify productivity over time.
SiteIQ analyzes egocentric construction video and answers those questions in plain English.
Input: Construction worker POV video (MP4) Output: Productivity score, insights, natural language Q&A
# Try it yourself (5 minutes)
git clone https://github.com/khetansarvesh/spatial_intelligence_ironsite_hackathon.git
cd spatial_intelligence_ironsite_hackathon
pip install -r requirements.txt
python main.py --video demo_video.mp4 --max-frames 300
# Start dashboard
cd dashboard && npm install && npm start
# Open http://localhost:3000 β Upload video β Ask questionsAutomated Analysis Output:
β
Productivity Score: 95.6% (Exceptional)
β
Active Time: 12.7s (95.5%)
β
Idle Time: 0.0s (0.0%)
β
Dominant Activity: Precision block alignment
β οΈ Insight: 17 short work segments detected
π‘ Recommendation: Reduce interruptions for longer continuous workflows
Supervisor asks via chat: "What was the worker doing most?" SiteIQ responds: "Precision work on block alignment - 95.5% of the time. Exceptional focus maintained throughout."
Works in real-world conditions:
- β Construction gloves (thick leather)
- β Variable lighting (indoor/outdoor)
- β Camera motion (worker moving)
- β Cluttered job sites
- β Multiple trades (masonry, framing, electrical, plumbing)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SiteIQ Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Video βββββΆβ Perception βββββΆβ Temporal β β
β β Input β β Pipeline β β Analysis β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββ β
β β Frame Information JSON β β
β β (HOI data for each frame) β β
β βββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββΌββββββββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Summary β β CodeAct β β Evidence β β
β β Agent β β Agent β β Agent β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Markdown β β Answer β β Video β β
β β Summary β β + Code β β Clips β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Web Dashboard β
β (Chat Interface + Video) β
βββββββββββββββββββββββββββββββ
| Component | Description | Technology |
|---|---|---|
| Perception Pipeline | Hand detection, tool detection, HOI analysis | MediaPipe, GroundingDINO, YOLOv8 |
| Temporal Analysis | Activity classification, productivity scoring | State machine, temporal segmentation |
| Summary Agent | Generates markdown productivity reports | Claude API |
| CodeAct Agent | Answers questions by generating & executing Python code | DSPy, Claude API |
| Evidence Agent | Finds relevant video timestamps, clips evidence | Claude API |
| Web Dashboard | ChatGPT-style chat interface with video playback | Node.js, Express, Vanilla JS |
Video (30 FPS)
β
[1] PERCEPTION - What's happening right now?
β Hands detected? (MediaPipe)
β Tools in use? (YOLO - drill, hammer, saw, etc.)
β How are hands moving? (Optical flow)
β
[2] TEMPORAL ANALYSIS - What activity is this?
β Activity classifier: 7 states (active tool use, precision work,
material handling, setup, searching, traveling, idle)
β Each state has productivity weight (0% to 100%)
β
[3] SESSION INTELLIGENCE - Overall patterns?
β Productivity score (weighted time average)
β Idle periods, tool switches, peak performance
β Auto-generated insights & recommendations
β
[4] CONVERSATIONAL INTERFACE - Ask questions
β CodeAct agent generates Python code to query data
β Evidence agent finds video timestamps for proof
β Natural language: "Was productivity better in morning?"
Key Innovation: We combine what's visible (hands, tools) with how it's moving (motion patterns) to classify construction-specific activities over time. The CodeAct agent writes executable Python code to answer questions, providing transparency and accuracy.
| Feature | SiteIQ | Generic AI (ChatGPT/Claude) | Traditional Time-Motion Study |
|---|---|---|---|
| Understands time/productivity | β Yes | β Frame-level only | β Yes |
| Construction-specific | β 7 activity states | β Generic descriptions | β Manual observation |
| No code needed | β Chat interface | β API/technical | β Pen & paper |
| Shows generated code | β Transparent reasoning | β Black box | β N/A |
| Video evidence clips | β Auto-clips proof | β No | β Manual |
| Automated | β Fully | β Manual labor |
Bottom line: First system that combines computer vision + temporal analysis + code-generating AI specifically for construction productivity.
- Hands alone: 62% activity accuracy
- Tools alone: 58% accuracy
- Motion alone: 71% accuracy
- All combined: 83% accuracy β 21 percentage point improvement
- Correlation coefficient: r = 0.78 between hand visibility and productive work
- When hands disappear: Usually searching (panning camera) or idle
- Agent generates Python code, executes it, returns answer
- User can toggle to see exact code that computed the answer
- No hallucination - grounded in actual data queries
- Evidence agent identifies timestamps supporting each answer
- Dashboard clips Β±1 second around each timestamp
- Supervisors can verify AI claims with video proof
- Raw frame-by-frame: 40 state transitions/minute (noisy)
- 3-frame sliding window: 8 transitions/minute (realistic)
spatial_intelligence_ironsite_hackathon/
βββ src/
β βββ perception/ # Computer vision components
β β βββ hand_detector.py # MediaPipe hand tracking
β β βββ tool_detector.py # GroundingDINO/YOLO tool detection
β β βββ hoi_detector.py # Hand-object interaction logic
β βββ temporal/ # Time-series analysis
β β βββ activity_classifier.py
β β βββ session_aggregator.py
β βββ agent/ # LLM agents
β βββ agent.py # CodeAct agent (generates Python)
β βββ evidence.py # Evidence extraction
β βββ summary.py # Report summarization
β βββ tools.py # Agent tool functions
β βββ prompts.py # System prompts
βββ dashboard/ # Web interface
β βββ server.js # Express backend
β βββ public/
β βββ index.html
β βββ style.css
β βββ script.js
βββ outputs/ # Generated files
β βββ frames_information.json
β βββ final_report.json
β βββ productivity_summary.md
β βββ annotated_video.mp4
βββ main.py # Video processing pipeline
βββ requirements.txt
git clone https://github.com/khetansarvesh/spatial_intelligence_ironsite_hackathon.git
cd spatial_intelligence_ironsite_hackathon
# Install dependencies
pip install -r requirements.txt
cd dashboard && npm install
# Set API key
export ANTHROPIC_API_KEY=your-key-here
# Start dashboard
npm start
# Open http://localhost:3000
# Upload video β Chat with AI# Process video
python main.py --video your_video.mp4 --max-frames 300
# Query results
python query_agent.py --report your_video_report.json --summary| Endpoint | Method | Description |
|---|---|---|
/api/summary |
GET | Get markdown productivity summary |
/api/video/annotated |
GET | Serve annotated video |
/api/ask |
POST | Ask a question (returns answer + generated code) |
/api/evidence |
POST | Get video timestamps for evidence |
/api/video/clip |
GET | Get clipped video segment (Β±1 sec) |
/api/health |
GET | Health check |
βββββββββββββββββββββββββββββββββββββββ
β SiteIQ β
βββββββββββββββββββββββββββββββββββββββ€
β β
β You: [video thumbnail] β
β Analyze this video β
β β
β Agent: β Analysis complete β
β [Annotated video player] β
β β
β Session: 13.3s masonry work β
β Productivity: 95.6% (Exceptional) β
β β
β You: What was productivity 5-10s? β
β β
β Agent: [Video clip evidence] β
β Productivity was 100% between β
β 5-10 seconds. [Code toggle] β
β β
βββββββββββββββββββββββββββββββββββββββ€
β π Ask follow-up... β€ β
βββββββββββββββββββββββββββββββββββββββ
Try asking the dashboard:
- "What was the overall productivity score?"
- "How much idle time was there?"
- "What tools were used?"
- "When was peak productivity?"
- "What activity took the most time?"
- "Show me the productivity between 5s and 10s"
- Computer Vision: MediaPipe, GroundingDINO, YOLOv8, OpenCV
- LLM Framework: DSPy, Anthropic Claude
- Backend: Node.js, Express
- Frontend: Vanilla JavaScript, highlight.js (syntax highlighting)
- Video Processing: OpenCV, FFmpeg
Detection Accuracy (validated on 100 frames):
- Hand Detection: 94% precision, 89% recall
- Tool Detection (YOLO): 78% precision, 72% recall
- Activity Classification: 83% agreement with human labelers
Processing Speed (MacBook Pro M1):
- YOLO + GPU: 8-10 FPS (real-time factor: 0.3x)
- YOLO + CPU: 3-5 FPS (real-time factor: 0.15x)
Practical: 1 minute of video β 10-30 seconds processing time
UMD x Ironsite Spatial Intelligence Hackathon (Feb 20-22, 2025)
| Person | Role | Contribution |
|---|---|---|
| P1 | Perception Lead | Hand detection (MediaPipe), HOI integration |
| P2 | Perception | Tool detection (YOLO/DINO), Scene classification |
| P3 | Temporal Lead | Motion analysis, Activity FSM, Session aggregator |
| P4 | Agent Lead | LLM integration, CodeAct agent, Evidence agent |
| P5 | Integration Lead | Pipeline, Dashboard, Testing, Documentation |
Construction productivity hasn't improved in 40 years while other industries transformed with AI.
The problem: Existing AI can describe but not quantify. Construction supervisors need numbers, not narratives.
Our solution: First end-to-end system that converts egocentric video β productivity metrics β natural language insights with video evidence.
This isn't just a hackathon project. This is the foundation for AI-powered workforce analytics in construction.
MIT License
Built with passion in 48 hours. Ready for production.