Vision Pro Stereo Video Capture

By Griffin Hurt <griffhurt@pitt.edu> with resources from the Surreality Lab

Overview

This project provides an easy way to capture rectified stereo videos with intrinsic and extrinsic camera data using the Apple Vision Pro. While the Vision Pro supports stereo video capture via spatial video, camera parameters are not saved, which makes projection into a point cloud difficult. This application exports camera parameters as a .json file to make reconstruction more straightforward (assuming a high-fidelity stereo matching pipeline is available).

Usage

Make sure you have the newest Xcode 26 beta installed on your Mac and the visionOS 26 beta installed on your Vision Pro.
Replace the Enterprise.license dummy file with your license.
Build the application and run it on the Apple Vision Pro.
Start the camera preview using the button at the bottom of the screen (this starts the immersive space, which is necessary for main camera access)
Start capturing video using the red button
End your capture by pressing the same button (now with a stop icon)
Captures will be saved to the application's folder in the "Files" application. Video files are saved with side-by-side format in the "Video Captures" folder and data files are stored in JSON in the "Video Data" folder.

Example Camera Data File

Below is an example of the format for video data files:

{
    "right_extrinsics": {
        "m00": 1.0000001,
        "m01": -4.034555e-10,
        ...
        "m33": 1,
    },
    "left_intrinsics": {
        "m00": 496.95917,
        ...
        "m22": 1
    },
    "right_intrinsics": {
        ...
    },
    "left_extrinsics": {
        ...
    }
}

The Vision Pro cameras are displaced in the $x$ coordinate, so baseline can be extracted by computing abs(data["left_extrinsics"]["m03"] - data["right_extrinsics"]["m03"]). Those seeking a more robust solution may choose to find the distance between the translation vectors of the cameras (4th column vector in the extrinsics).

Advice on Stereo Matching

We have been able to create the most accurate point clouds from Vision Pro captures using the FoundationStereo model from NVIDIA. Using an inference batch size of 4, frames take ~1 second to process on an RTX A5000. Whenever the script to generate point clouds from these captures is cleaner, I can add it to the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
StereoCapture.xcodeproj		StereoCapture.xcodeproj
StereoCapture		StereoCapture
StereoCaptureTests		StereoCaptureTests
assets		assets
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Pro Stereo Video Capture

Overview

Usage

Example Camera Data File

Advice on Stereo Matching

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Pro Stereo Video Capture

Overview

Usage

Example Camera Data File

Advice on Stereo Matching

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages