GOKU: Gaussian Object Kinematics Unifier

Inspiration💡

Our goal is to build generative models that can generate rich 4D worlds much like the way we perceive the world (3D that moves with time). Our goal also encompassed control for the user to allow our project to be used for a variety of applications including but not limited to: robotics, digital twins, interactive media, animation, and many more.

Thus, we wanted to allow the user to give a 3D scene and then our project would generate a consistent 4D world.

What it does🔎

GOKU first takes a bunch of multi-view images (usually with camera poses) and reconstructs a 3D Gaussian Splat from it. The user selects a point and draws a 3D curve in this reconstructed 3D scene. We then generate many more multiview images with our projection, warping, and calls to Gemini which are finally used to reconstruct a 4D Gaussian Splat.

Our approach allows generating 4D worlds (3D that moves with time) from 3D scenes.

How we built it🔨

Our project is divided into multiple components:

The user captures some 2D images

We first train a 3D Gaussian Splat on the set of 2D images and associated camera parameter estimations (using COLMAP SfM). We utilize COLMAP's Structure-from-Motion (SfM) pipeline to estimate camera parameters for each image. COLMAP performs the following steps:

Feature extraction using SIFT descriptors
Feature matching and geometric verification
Incremental reconstruction to recover camera poses and intrinsics
Bundle adjustment to refine the estimated parameters

This process generates camera extrinsic matrices (position and orientation) and intrinsic parameters (focal length, principal point) for each viewpoint.

With calibrated camera parameters, we employ 3D Gaussian Splatting to create a high-quality initial 3D representation of the scene. Unlike traditional mesh or voxel-based approaches, Gaussian Splatting represents the scene as a collection of 3D Gaussian primitives that:

Adaptively scale to represent scene details at appropriate resolutions
Provide high rendering quality with significantly lower memory requirements
Enable efficient differentiable rendering for optimization

The user specifies some control

To enable object-level control, we implement automatic scene segmentation using the Segment Anything Model (SAM). The process works as follows:

For each input image, we generate SAM embeddings
We apply automated segmentation to identify distinct objects
Objects are tracked across multiple views using feature matching and geometric consistency
The segmented objects are projected onto the 3D Gaussian representation

This multi-view consistent segmentation allows users to select specific objects for manipulation while maintaining the integrity of the remaining scene elements. The user then makes a 3D curve on the scene. This 3D curve represents the main component of control the user needs, moving a particular part of the 3D scene in a certain direction.

Preparing data for building the 4D scene

To create the necessary training data for the 4D reconstruction, we project the 3D control curves into each input viewpoint:

For each viewpoint and timestep, the 3D curve position is projected into image space
The projection uses the camera parameters estimated during the SfM stage
The resulting 2D displacement vectors indicate how pixels should move in each image

However, these images are merely warped in a direction and not suitable to use for any 4D reconstruction

Next up, we find that vision-language models are excellent zero-shot image correctors. To address artifacts and inconsistencies introduced by the warping process, we employ the Gemini vision-language model:

Each warped image is provided to Gemini along with the original image
A prompt constructs Gemini to correct physical inconsistencies while preserving the intended motion
This prompt needs to be highly detailed to get Gemini to retain consistency so we also prompt Gemini to make the above detailed prompt
Gemini generates refined images with natural lighting, shadows, and object interactions

We continue this process for all the timesteps.

Construct a 4D scene

With multi-view images now available for each timestep, we extend the 3D Gaussian Splatting approach to incorporate the temporal dimension:

Gaussians are initialized from the static 3D reconstruction
Each Gaussian receives additional parameters for its temporal trajectory
Temporal trajectories are represented as continuous functions (B-splines)
Optimization occurs across both spatial and temporal dimensions simultaneously

Challenges we ran into⚠️

Object Segmentation: Achieving high precision in segmenting and tracking objects across views and time was a complex task.
Trajectory Propagation: Converting the trajectory given by the user to camera coordinates for each of the 100s of images required a lot of planning and a deep understanding of advanced computer vision techniques.
User Interface Complexity: Designing an intuitive UI that could handle both 3D and 4D outputs while ensuring real-time responsiveness was a significant hurdle.

Accomplishments that we're proud of🥇

Innovative 4D Reconstruction: Successfully leverages the information inside existing 2D vision-language models to generate dynamic 4D scenes, a very challenging problem.
Seamless Object Manipulation: Enabled precise, user-guided object movement with consistent multi-view editing via the Gemini API.
Robust Integration: Combined state-of-the-art reconstruction techniques with an interactive UI, delivering a tool that is both powerful and user-friendly.

What we learned🧠

Complexity of Temporal Consistency: Maintaining consistency across multiple views and time steps required innovative strategies in both model training and image editing.
Importance of User-Centric Design: Balancing cutting-edge AI capabilities with a user-friendly interface was critical in delivering an impactful product.

What's next for our project💭

Scalability & Real-Time Performance: Optimize the system to handle larger datasets and more complex scenes for real-world applications such as city planning and environmental monitoring.
Multi-Object Dynamics: Extend the framework to simultaneously control and animate multiple objects with interdependent trajectories and physical interactions.
Physics-Based Simulation Integration: Incorporate physical simulation constraints to ensure that object movements adhere to realistic physics, including gravity, momentum, and collision responses.
Temporal Super-Resolution: Develop methods to synthesize intermediate frames at arbitrary temporal resolution from more sparsely defined keyframes.
Dynamic Lighting and Shadows: Support changing illumination conditions throughout the temporal sequence, allowing for time-of-day changes and dynamic lighting effects.

Awards & Recognitions

We want our project to be considered for the following awards:

Best Generative AI Technology Hack
Best Use of Gemini
- We use the Gemini API for a core aspect of our project.
United Nations & One Degree Cooler Best Climate Change & Sustainability AI Hack
- Our project enables high-fidelity virtual simulations of environmental changes over time, reducing the need for resource-intensive physical prototyping while providing powerful visualization tools for climate modeling, urban planning, and ecosystem management.
Best Education AI Hack
- A big application of our project is towards fully-immersive education. While things like 3D are becoming very prominent in education, our approach can go a step further and generate very large 4D worlds which can be used for education.