A real-time multimodal streaming system powered by our AURA model, supporting continuous video understanding with speech interaction.
- Real-Time Streaming: Continuously processes live video at 2 FPS with sub-second response latency
- Full Pipeline: Integrated ASR, Vision-Language Model, and Streaming TTS, all running locally
- Context Management: Sliding-window history with automatic pruning and prefix KV cache reuse for bounded latency
- One-Click Launch: Single script (
start_all.sh) to start all services with automatic GPU allocation
| Demo | Video |
|---|---|
| The pens on the desk | The pens on the desk.mp4 |
| Watch the kettle for me | Watch the kettle for me.mp4 |
| 史迪仔放在哪? | 史迪仔放在哪?.mp4 |
| 小馋猫偷吃冻干 | 小馋猫偷吃冻干.mp4 |
| 工作时不准摸鱼 | 工作时不准摸鱼.mp4 |
| 帮我找找鼠标 | 帮我找找鼠标.mp4 |
| 帮我盯着烧水壶 | 帮我盯着烧水壶.mp4 |
| 我刚才关灯了吗 | 我刚才关灯了吗.mp4 |
| 桌面上的笔 | 桌面上的笔.mp4 |
| 绿灯才能过马路哦 | 绿灯才能过马路哦.mp4 |
Click a link above to download and watch the demo video. All demos are located in the
demos/folder. For more demo videos, please visit our Home Page.
| Category | Requirement |
|---|---|
| Python | 3.12 |
| PyTorch | 2.10+ with CUDA 12.8 |
| vLLM | >= 0.17.1 (V1 engine with Automatic Prefix Caching) |
| GPU | 2+ (minimum: 1 for ASR+TTS, 1 for AURA-8B inference) |
| System | ffmpeg, numactl |
| OS | Linux (tested on Ubuntu 22.04) |
| Browser | Google Chrome (desktop or mobile) |
We use uv for fast, reproducible environment management.
If you do not have uv installed yet:
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/aurateam2026/AURA.git && cd AURA
# Install system dependencies
sudo apt install -y ffmpeg numactl
# Create a Python 3.12 virtual environment and install all packages
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
# Install flash-attn (requires a platform-specific .whl matching your CUDA/PyTorch/arch)
# Download the correct wheel from https://github.com/Dao-AILab/flash-attention/releases
# Example for CUDA 12 + PyTorch 2.10 + x86_64:
uv pip install flash_attn-2.8.3+cu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whlNote:
flash-attnis not included inrequirements.txtbecause it requires a platform-specific.whlfile. You must download the correct wheel that matches your CUDA version, PyTorch version, and CPU architecture, then install it manually.
Note:
qwen-asris installed from a patched fork instead of PyPI. The upstreamqwen-asr==0.0.6targets vLLM 0.14.0, while AURA requires vLLM 0.17.1+. Our fork applies a minimal patch (1 file, ~10 lines) to fix three deprecated API calls. Once the upstream package releases a vLLM 0.17+ compatible version, we will switch back to PyPI.
Note: The
Qwen3-TTS-streaming/subdirectory is a local library loaded at runtime viasys.path. It does not need a separate install.
source .venv/bin/activate
python -c "
import torch, vllm, flask, qwen_omni_utils
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'vLLM: {vllm.__version__}')
print(f'GPUs: {torch.cuda.device_count()}')
"Expected output:
PyTorch: 2.10.0
CUDA: 12.8
vLLM: 0.17.1
GPUs: 2 (or more)
Download the following models from Hugging Face:
| Model | Purpose | Size |
|---|---|---|
| AURA-8B | Main vision-language model | ~16 GB |
| Qwen3-ASR-1.7B | Automatic Speech Recognition | ~3 GB |
| Qwen3-TTS-12Hz-1.7B-Base | Text-to-Speech synthesis | ~4 GB |
# Default: GPU 0 for ASR+TTS, GPU 1 for AURA inference
bash start_all.shThe script automatically:
- Cleans up any leftover processes on ports 8001, 8002, 12345
- Starts ASR, TTS, and vLLM inference server in order
- Waits for each service to be healthy before proceeding
- Logs to
logs/asr.log,logs/tts.log,logs/vllm.log Ctrl+Ccleanly shuts down all services
Custom GPU allocation:
GPU_ASR=0 GPU_TTS=0 GPU_INFERENCE=1 bash start_all.sh
# Multi-GPU inference (tensor parallel)
GPU_ASR=0 GPU_TTS=0 GPU_INFERENCE=2,3 bash start_all.shThe web frontend connects to the backend inference server via a TCP socket. By default, the backend hostname is configured in realtime_capture_video_audio_streaming.py:
SERVER_HOST = 'localhost' # Change this to match your setup
SERVER_PORT = 12345- If the frontend and backend run on the same machine, change
SERVER_HOSTto'localhost'. - If they run on different machines, set
SERVER_HOSTto the hostname or IP address of the machine running the backend services.
Then start the frontend in a separate terminal:
source .venv/bin/activate
python realtime_capture_video_audio_streaming.py| Mode | Command |
|---|---|
| HTTP (default) | python realtime_capture_video_audio_streaming.py |
| HTTPS | python realtime_capture_video_audio_streaming.py --https |
| Cloudflare Tunnel | python realtime_capture_video_audio_streaming.py --tunnel |
Required: You must use Google Chrome to access the demo. Chrome is the only browser that fully supports the camera, microphone, MediaRecorder, and Web Audio APIs used by AURA. Safari and Firefox are not supported and may fail silently.
Local access (desktop):
Open http://localhost:5003 in Chrome.
Remote access from a phone:
Open the demo in Chrome for Android or Chrome for iOS. The phone must be able to reach the frontend server. There are several ways:
-
Same LAN: If the phone and the server are on the same network, open
http://<server-ip>:5003in Chrome on your phone. Note that Chrome requires HTTPS to access the camera and microphone from a non-localhost address. -
HTTPS mode (recommended for LAN access from phone):
# Generate a self-signed certificate (one-time setup) openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes # Start the frontend with HTTPS python realtime_capture_video_audio_streaming.py --https
Then open
https://<server-ip>:5003in Chrome on your phone. You will need to accept the self-signed certificate warning. -
Cloudflare Tunnel (recommended for public/cross-network access):
python realtime_capture_video_audio_streaming.py --tunnel
This creates a public HTTPS URL that you can open in Chrome on any device without network restrictions.
The interface has three buttons at the bottom of the screen:
| Button | Icon | Action |
|---|---|---|
| Start | Camera | Tap to start/stop the video stream. The camera feed is sent to the backend for real-time understanding. |
| Record | Microphone | Press and hold to record your voice. Release to stop recording and send the audio to the server for ASR. Do not tap -- you must hold the button down while speaking. |
| Flip | Camera Rotate | Tap to switch between the front and rear cameras (useful on phones). |
Typical workflow:
- Open the URL in Google Chrome (desktop or mobile).
- Tap Start to activate the camera. Grant camera permission when prompted by Chrome.
- Point the camera at something you want AURA to understand.
- Press and hold the Record button while asking your question out loud. Release when done speaking. Grant microphone permission when prompted.
- Watch the streaming text response appear on screen in real-time.
- The TTS audio response will play automatically through your speaker.
- Tap Flip to switch cameras if needed (front/rear).
- Tap Start again to stop the video stream.
Tip: On mobile devices, make sure to grant both camera and microphone permissions when Chrome prompts you. If permissions were previously denied, reset them in Chrome Settings > Site Settings.
If you prefer to start services individually:
Step 1: ASR Service (Port 8001)
CUDA_VISIBLE_DEVICES=0 python Qwen3_asr_serve.py \
--host 0.0.0.0 --port 8001 \
--model Qwen/Qwen3-ASR-1.7B \
--forced-aligner Qwen/Qwen3-ForcedAligner-0.6B \
--gpu-memory-utilization 0.3Step 2: TTS Service (Port 8002)
CUDA_VISIBLE_DEVICES=0 bash tts_service.shVerify: curl http://localhost:8002/v1/tts/health
Step 3: Main Inference Server (Port 12345)
CUDA_VISIBLE_DEVICES=1 bash Qwen3_VL_online_streaming_v2_CM.shWait for: Server listening on port 12345
Step 4: Web Frontend (Port 5003)
python realtime_capture_video_audio_streaming.pyOpen: http://localhost:5003
| GPU | Service | VRAM |
|---|---|---|
| GPU 0 | ASR (Qwen3-ASR-1.7B) + TTS (Qwen3-TTS-1.7B) | ~7 GB |
| GPU 1 | AURA-8B inference (vLLM, TP=1) | ~16 GB |
Main inference parameters in Qwen3_VL_online_streaming_v2_CM.sh:
| Parameter | Default | Description |
|---|---|---|
--max-model-len |
262144 | Maximum context length (256K tokens) |
--temperature |
0.5 | Sampling temperature |
--max-tokens |
128 | Max tokens per response |
--cross-turn-penalty |
1 | Cross-turn repetition penalty strength |
--cross-turn-lookback |
10 | Number of recent turns to penalize |
--enable-pruning |
— | Enable sliding-window context pruning |
--max-rounds |
45 | Trigger pruning when rounds exceed this |
--num-rounds-keep |
30 | Rounds to keep after pruning |
--kv-offloading-size |
10 | KV cache CPU offload size (GB) |
├── start_all.sh # One-click launch script
├── Qwen3_VL_online_streaming_v2_CM.sh # Main inference launch script
├── Qwen3_VL_online_streaming_v2_ContextManaged.py # Core: vLLM engine + context management + TCP server
├── Qwen3_asr_serve.py # ASR service (FastAPI + Qwen3-ASR)
├── tts_service.py / tts_service.sh # TTS service (streaming synthesis)
├── context_manage.py # Context management utilities
├── realtime_capture_video_audio_streaming.py # Web frontend middleware (Flask)
├── templates/index_streaming.html # Browser UI (main interface)
├── templates/video-call.html # Browser UI (video call style)
├── requirements.txt # Python dependencies
├── shuhan.mp3 # TTS reference audio for voice cloning
└── Qwen3-TTS-streaming/ # TTS model inference library
| Issue | Solution |
|---|---|
sched_setaffinity: Invalid argument |
Remove numactl from the launch script |
| ASR returns empty text | Ensure the ASR service is running on port 8001 before starting the main server |
| TTS voice clone fails | Verify the reference audio file exists in the working directory |
| OOM on main GPU | Reduce --gpu-memory-utilization or --max-model-len |
| vLLM version error | Requires vLLM >= 0.17.1 with V1 engine support |
| Phone cannot access camera/mic | Use HTTPS mode or Cloudflare Tunnel (browsers require HTTPS for media on non-localhost) |
SERVER_HOST connection refused |
Verify SERVER_HOST in realtime_capture_video_audio_streaming.py matches your backend host |
This project is released under the Apache-2.0 License.
@article{aura2026,
title={AURA: Always-On Understanding and Real-Time Assistance via Video Streams},
author={Lu, Xudong and Bo, Yang and Chen, Jinpeng and Li, Shuhan and Guo, Xintong and Guan, Huankang and Liu, Fang and Xu, Dunyuan and Sun, Peiwen and Sun, Heyang and Liu, Rui and Li, Hongsheng},
journal={arXiv preprint arXiv:2604.04184},
year={2026}
}