Inspiration

At the surface level, construction happens when a team of workers alternate between repeatable tasks. Analyzing video footage of this should be easy, right?

After thinking a bit deeper and looking at the example footage provided by Ironsite, we quickly realized that applying any Machine Learning is... pretty darn hard. The footage is shaky, blurry, and hard to decipher what's going on.

It was even hard for us humans to decipher what was happening in the scene, so we had a burning desire to first reconstruct the scene.

The https://arxiv.org/abs/2503.11651 paper came to mind, but this technique was created for clean photos / videos… not for the type of footage we are tasked to work with.

The VGGT technique failed to produce data that was interpretable to the LLM, we felt that it was a solid basis for applying more standard ML techniques to analyze the data.

Our VR application is inspired by the fact that in a construction site, there might be a lot of tools everywhere and it is hard for anyone to remember where everything was. Thus, what if in the future, we all wear smart glasses that can record and locate any objects for us through analyzing the video streaming. Better still, since a worker's hands are almost always busy, the interaction should be hands-free. You just ask the device — "where's the screwdriver?" — and it shows you, drawing a path on the floor from where you stand to where it last saw the tool.

Inspired by the ideas behind RAG, we take that a step further with IronWorld, our framework that makes 3D reconstruction techniques traversable to agentic LLMs such as Gemma 4, allowing general intelligence to take advantage of spatial data better.

What it does

Iron World Backend The way it works is — you upload any video. It feeds into a 3D environment built by three models. Pi 3X reconstructs the scene in 3D, OWLv2 detects common objects in each frame, and SigLIP embeds each cluster in a shared vision-language space so abstract queries like "the dirtiest area" still land somewhere meaningful.

\begin{equation} P(\ell) = 1 - \prod_{j} (1 - c_j) \end{equation}

When OWLv2's detections come back, each 3D point gets a vote from every detection it falls inside. A point only becomes confident in a label when multiple frames agree.

\begin{equation} s(f) = C^{\alpha} \cdot A^{\beta} \cdot V^{\gamma} \cdot e^{-\lambda_o O} \cdot Q \end{equation}

When the agent commits to a candidate region, it scores every source frame for how clearly that region shows up --- coverage C, projected framing A, view angle V, occlusion O, and image sharpness Q. The winning frame is what the user sees in the chat with the bounding box drawn back onto the original footage.

\begin{equation} J(v) = \sum_{i} b_i \, U_i \cdot \text{gain}(v, h_i) - \lambda \cdot \text{cost}(v) \end{equation}

If the agent is still unsure, it picks the next viewpoint that should resolve the most uncertainty about the user's query --- balancing what it'd learn against how far the figurine has to move.

The VR Application

  1. Mapping the space. Using Meta's Scene API and the headset's depth sensor + cameras, the app builds a 3D model of the room — automatically labeling the floor, walls, ceiling, and large furniture, and producing a fine-grained mesh of all real-world surfaces.
    2. Detecting and locating objects. When the worker presses the controller trigger, the headset captures a frame from the passthrough camera and sends it to Gemini, which identifies and labels the objects in view (tools, materials, equipment). For each detection, the app casts a ray from the camera through the object's bounding box onto the room mesh, finds the exact surface the object is sitting on, and pins a labeled marker at that 3D location.

  2. Wayfinding. Once an object is registered, the worker can locate it later — by voice query ("where's the screwdriver?"). The headset computes a walkable path across the floor (avoiding walls and furniture) and renders it as a glowing line in mixed reality, leading the worker directly to the tool.

How we built it

Iron World Backend We sort of ended up doing a 36-hour "research"-athon instead of a hackathon. We experimented with the ol' faithful models such as SAM3.1, but they generally failed to provide good and reliable outputs. We needed a better backbone for our data.

For example, one of the things we experimented with is unsupervised clustering on video footage is one of the things we fell in love with and used in Georgia Tech’s Hacklytics 2026. We felt that we could apply it here to detect repetitive worker movements, but it was nearly impossible to find out what _ exactly _ we wanted to cluster. At Hackyltics, we just clustered by the space defined by YOLO poses of people's movements, and we got to choose the data to work with (clean workout videos), but there is no such clean invariance in the data that Ironsite has to work with.

We then pivoted to scene reconstruction for the purposes detecting safety hazard violations, but it struggled to maintain object permanence when using the shaky helmet camera footage.

We finally ended up exploring statistical methods to improve the ability for LLMs to navigate a re-constructed 3D scene, which led to IronWorld.

The VR Application

  1. The first step is to make a room scan. This is handled by Meta’s API. It creates a 3D model of the room by combining the infra-red depth sensor and the camera of the Quest3 headset.

  2. Pressing the controller trigger captures a frame from Meta Quest's camera and sends it to Gemini, which returns 2D bounding boxes and labels for each detected object. For each detection, a ray is cast from the camera through the center of the bounding box; and intersects the GlobalMesh at the surface the object sits on (e.g. a table). A labeled marker is placed at that 3D world point.

  3. Voice queries are handled in a single Gemini call. When the worker holds the left grip button on the controller, the headset starts recording from its built-in microphone at 16 kHz. On release, the recording is encoded as a WAV file and sent to Gemini 2.5 Flash along with a text prompt containing the list of object labels currently pinned in the room. Gemini — which accepts audio inputs natively — performs speech-to-text, intent recognition, and label matching in a single step, returning the closest matching label (e.g. "phone" → iPhone, "computer" → laptop). The matched label is then handed to the navigation system, which finds the corresponding marker and draws the path to it.

Challenges we ran into

Toward analyzing anything in the helmet camera videos, finding the right primitives to anchor ourselves in space is definitely the toughest part. Whether we're trying to count the number of bricks to place, identify risks to workers, or determine repetitive stuff that a worker is doing, we found it difficult to understand exactly what we should be looking at and what we should be expecting. There is a combinatorial explosion of environments and things the cameras could be subjected to at the construction sites.

A lot of times, something blocks the camera or the workers only look at objects of interest for a split second before swiftly moving away. The fisheye lens, lighting condition variability, and motion blur also doesn't help. We found that many open source segmentation models fall apart here.

Accomplishments that we're proud of

We are proud of sticking to the goal of improving spatial AI. We had some experience working with Computer Vision, but we had little idea of how difficult it would be to analyze footage outside of the lab. There were many moments we wanted to quit and pivot to a different challenge, but instead we pivoted to other ideas within IronSite’s challenge.

What we learned

Computer Vision can get REALLY tough! We were too used to having open source YOLO and SAM models that do everything for free. Working on this project helped us tune our understanding of what can be done, what can’t be done, and what we don’t know.

What's next for Iron World

Iron World Backend We definitely want to explore more applications of what we built and improve on the reliability of the system in tough situations. Jack bikes with a Go-Pro on his head every day, and it would be interesting to see if we can re-construct the entire campus of the University of Florida and see if the system works at a much larger scale.

VR Application

  1. Continuous, hands-free observation: Drop the manual trigger — let the headset detect objects passively as the worker moves through the site, deduplicating across frames.
  2. Tool history, not just current state: "When did I last use the wrench?" "Who moved the toolbox?" — turning the device from a spatial index into a temporal one.
    1. Multi-worker shared map: Cloud-shared spatial anchors so the whole crew sees the same map of where every tool is.
    2. Domain-specific fine-tuning: Generic Gemini misses construction-specific objects (rebar gauges, specific drill bits, brand-stamped equipment). A fine-tuned tool-recognition layer would close this gap — this is where the hackathon's "fine-tuning" prize bucket matches naturally.

GitHub Link https://github.com/jiekaitao/ironworld

Built With

Share this project:

Updates