MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

This is the repo for SenSys 2026 paper: "MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding".

Introduction

This repository provides the public release of MMEdge, a real-time on-device multimodal inference framework based on pipelined sensing and encoding. Unlike traditional multimodal systems that wait for complete sensor inputs before inference, MMEdge decomposes data collection and computation into fine-grained sensing and encoding units, enabling fully pipelined and parallel execution across modalities. To maintain accuracy under this fine-grained design, MMEdge introduces a lightweight temporal aggregation module that preserves temporal continuity across units. It further incorporates an adaptive multimodal configuration optimizer that dynamically selects optimal sensing and model configurations under latency constraints, and a cross-modal speculative skipping mechanism that bypasses redundant computations when early predictions reach high confidence. MMEdge achieves up to 75.8% reduction in end-to-end latency while maintaining comparable accuracy on multiple public datasets (LRW, NuScenes-QA) and a real-world UAV multimodal testbed, demonstrating its effectiveness for low-latency on-device multimodal perception and reasoning.

✅ Paper: https://arxiv.org/abs/2510.25327v5
✅ Demo Video: https://www.youtube.com/watch?v=n36M9ho2z9o

System Overview

MMEdge introduces three core modules to accelerate on-device multimodal inference:

Pipelined Sensing and Encoding: Decomposes sensing and inference into fine-grained units (e.g., frames or chunks), allowing parallel data collection and feature encoding to reduce idle time.
Adaptive Multimodal Configuration: Dynamically selects the optimal sensing and model configurations under latency constraints using a lightweight accuracy predictor and pre-profiled latency table.
Cross-Modal Speculative Skipping: Enables early inference termination by leveraging faster modalities’ features, skipping redundant processing of slower modalities when confidence is sufficient.

Quick Start

Offline Stage

1. Installation

cd Offline
pip install requirements.txt

2. Dataset preparation

We take the audio-visual speech recognition task as an example in this repository. Please download the Lip Reading in the Wild (LRW) dataset here.
Move the dataset to ./Offline/data and ./Online/data.

3. Train multimodal models

Train video models
```
bash scripts/train_video.sh
```
Train audio models
```
bash scripts/train_audio.sh
```
Train fusion models
```
bash scripts/train_fusion.sh
```

4. Train accuracy predictor

Generate accuracy table

cd Offline
python make_accuracy_table.py

Train accuracy predictor
```
python train_accuracy_predictor.py
```

4. Train cross-modal speculative skipping model

python train_gating.py

Online Stage

1. Installation

cd Online
pip install requirements.txt

Download checkpoints and dataset to the device, saved at ./Online/checkpoints and ./Online/data respectively.

2. Run the data collection module to simulate real-time streaming sensor data during inference.

python data_collection_simulation.py

3. Run end-to-end inference

python main.py

Citation

Please consider to cite our paper if you use the code or data in your research project.

@article{huang2025mmedge,
  title={MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding},
  author={Huang, Runxi and Yu, Mingxuan and Tsoi, Mingyu and Ouyang, Xiaomin},
  journal={arXiv preprint arXiv:2510.25327},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Figures		Figures
Offline		Offline
Online		Online
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

Introduction

System Overview

Quick Start

Offline Stage

1. Installation

2. Dataset preparation

3. Train multimodal models

4. Train accuracy predictor

4. Train cross-modal speculative skipping model

Online Stage

1. Installation

2. Run the data collection module to simulate real-time streaming sensor data during inference.

3. Run end-to-end inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding

Introduction

System Overview

Quick Start

Offline Stage

1. Installation

2. Dataset preparation

3. Train multimodal models

4. Train accuracy predictor

4. Train cross-modal speculative skipping model

Online Stage

1. Installation

2. Run the data collection module to simulate real-time streaming sensor data during inference.

3. Run end-to-end inference

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages