Official implementation of OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams.
Yibin Yan*, Jilan Xu*, Shangzhe Di, Haoning Wu, Weidi Xie
(*: equal contribution)
- Release pre-training code.
- Release our VLM&VLA code.
git clone https://github.com/Go2Heart/OmniStream.git
cd OmniStream
conda create -n omnistream python=3.10 -y
conda activate omnistream
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.56.1We have uploaded our pre-trained model to 🤗huggingface.
from model import OmnistreamMultiFrameTransformer
from transformers import AutoImageProcessor
processor = AutoImageProcessor.from_pretrained("StreamFormer/OmniStream")
model = OmnistreamMultiFrameTransformer.from_pretrained("StreamFormer/OmniStream").to("cuda")
import torch
import numpy as np
model.eval()
fake_pixel = np.random.randn(16, 512, 512, 3) # BxT, H, W, C
fake_input = processor(images=fake_pixel, return_tensors="pt").to("cuda") # BxT, H, W, C
fake_input["pixel_values"] = fake_input["pixel_values"].unsqueeze(0).float() # B, T, H, W, C
with torch.no_grad():
output = model(**fake_input, return_dict=True)
print(output.keys())
print(output["last_hidden_state"].shape) # last layer's hidden states
print(output["hidden_states"][-1].shape) # last layer's hidden states
print(output["pooler_output"].shape) # cls token
print(output["patch_start_idx"]) # index of the first patch of each frame (1x[cls] + 4x[reg])If you find our work useful, please cite:
@article{yan2026omnistreamm
title={OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams},
author={Yibin Yan and Jilan Xu and Shangzhe Di and Haoning Wu and Weidi Xie},
journal={arXiv preprint arXiv:2603.12265},
year={2026},
url={https://arxiv.org/abs/2603.12265}
}
