Inspiration
Snap! Is like your iPhone live photos (where the picture shows movement before and after slightly) but 100x more memorable. If you’ve ever taken a photo of a fun event like a concert, kickback, or house party but felt like it couldn’t exactly capture the “vibe”, this is probably for you. By turning 2d photos into 3d experiences with pretty low latency (~10 seconds automated), it feels like you’re in the room again. Good times == good memories, so why not start preserving the vibes in between :)
What it does
Snap! applies depth mapping with computer vision to your photos to artificially create a 3D immersive scene. Then, that photo is rendered with ThreeJS in a 3D environment, so you can drag around in 180 degree angles and see what’s going on as if you were reliving that moment. This prototype is completely functional.
How we built it
1) Deciding on the photo input
This prototype presented a ton of unexpected, interesting technical challenges around cameras, optics, depth estimation, and 3d software like ThreeJS. Simulating a 3D camera with computer vision (depth estimation) and handling a completely different data type than ThreeJS typically uses for displacement maps was more hacky and required more workarounds.
I had to first figure out how to use an iPhone to mimic a 3D camera (such as Reto 3D that takes pictures at various angles and then combines those pictures into a “moving GIF”) as well as stereo cameras, which is known for mimicking how a human perceives the world (with 2 cameras instead of 2 eyes). Traditional methods for stereo cameras work by taking two or more images and estimating a 3D model of the scene. This is done by finding matching pixels in the images and converting their 2D positions into 3D depths. But this traditional method requires special lenses with expensive equipment.
Using an iPhone camera lens to have the same 3D effect is challenging, because it’s purely a 2d image. However, the data format I decided on was a .mov GIF created using the “Bounce effect” for Live Photos. It captures ~1-2 seconds before and after a picture was taken. It’s also the most functional for most, since handling stereo and 3d cameras along with the development process after could be complex.
Afterwards, I used a CLI tool, ffmpeg, to individually extract every frame (around 10-20 frames) for the .mov file.
2) Depth Estimation on the photos
Depth estimation is the technical meat of this hack as it provides the immersive effect. It quantifies spatial relationships within pixels, enabling the reconstruction of images into detailed 3D models by interpreting and mapping depth cues.
I ended up applying off-the-shelf model weights from the paper Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation. I decided on using this model due to it achieving high benchmark results, and being trained on 6M+ unlabelled and labeled images. PyTorch and OpenCV were used for applying the model’s weighs onto my images folder (pure inference, no fine-tuning). The output was a concatenated image of the raw image and its depth map transformed into an image data format. The depth map is a 2D array 564x564 that has values ranging from 0-255 indicating the closeness of the pixel to the camera.
3) Layering the Depth Map onto the 2D Image in ThreeJS
This was the tricky part where I had to come up with a work-around solution since ThreeJS doesn’t typically accept depth maps as displacement maps, so it required a bit of learning about 3D in ThreeJS. In Scene.JS, I essentially loaded the 2D image (½ of the inputted concatenated image) as a Canvas Texture and the depth map as a Canvas Texture as well (2/2 of the inputted concatenated image). Then, I loaded each corresponding depth map as a displacement map using ThreejS’s MeshStandardMaterial. As for the plane that the whole 3D environment was living on, a 10 x 10 x 512 x 512 plane was created with the mesh including the plane and the material (MeshStandardMaterial).
OrbitControls from ThreeJS were also loaded to support camera movement (that way, you can explore the image with your mouse as if you were in the scene). For aesthetics, I loaded ambient light settings.
4) Animating the frames
By this time, I had successfully outputted a singular 3d photo in ThreeJS. However, I wanted to see if I could make a simple animation loop that would apply depth maps onto its corresponding photo and render it quickly enough so it would look like it was a moving scene. This was completed by creating a custom animation function that would calculate the elapsed time and then determine when to re-render the next frame. Eventually, I landed on specific frame settings that mimicked a GIF, as the photos were moving quickly enough to simulate movement.
Challenges we ran into
Originally, I applied the depth map as a separate layer from the image canvas texture. This was a bit tricky to troubleshoot because there was essentially zero output from the rendered ThreejS environment. Closely reading the ThreeJS docs helped, and eventually I found the custom function THREE.MeshStandardMaterial that led to the displacement map displaying.
Figuring out whether I would need some sort of calibration for the moving GIF portion was also a challenge. I originally thought of a more complex solution: individually aligning similar points of the same edges across multiple frames in order to account for images where there was more movement occurring. I was worried that the depth maps would be misaligned across various frames. However, aligning the points was not needed since the frames were taken granularly enough that there didn’t seem to be too much diversion from the pin point.
Accomplishments that we're proud of
Rendering the GIF onto ThreeJS! I originally thought it was only feasible to render an image in the environment, but through a workaround solution using a custom animation function, I was able to render 10-20 frames fast enough to achieve the “movement” effect. Lots of workarounds in this prototype (previous depth map as a displacement map mentioned earlier).
What we learned
Cameras are difficult to simulate on software! ThreeJS geometry, meshes, and materials were a bit abstract to grasp at first, as it was hard to figure out what would be the equivalent formats for this purpose. Eventually, I decided to have the 2d Image be a texture and then a material, and then layering the depth map as a displacement map through that.
What's next for Snap!
NerFS instead of depth estimation for turning 2D photos into accurate 3D scenes! It’d be cool to compare this classical CV depth estimation technique with a more recent finding. However, NeRFs take a lot of compute as it needs to perform inference on every x,y,z point. It’d be great to have an NVIDIA RTX 4080 GPU to do that ;)
Log in or sign up for Devpost to join the conversation.