StarVLA Training Efficiency Report

# StarVLA Training Efficiency Report

## Motivation
We have noticed huge discrepancies between the training efficiency of StarVLA reported in the community feedback and our empirical observations. We report training efficiency measurements for StarVLA to provide actionable scaling guidance, focusing on common distributed-training bottlenecks (compute and communication).

## Experimental setup

Unless otherwise specified:

- Model: **StarVLA-GR00T**
- Backbone: **Qwen3-VL-4B**
- Dataset: **RoboCasa-GR1**
- Hardware: **NVIDIA A100 80GB**
- Reported unit: **wall-clock time per 100K optimization steps**, including distributed communication and system overhead

## Metrics

We distinguish two throughput notions:

1. **Step throughput**: seconds/step (lower is better)
2. **Sample throughput**: samples/s (higher is better)

Sample throughput is computed as:

$$
		\text{samples/s} = \frac{\text{global batch}}{\text{seconds per step}}
$$

This distinction matters because distributed scaling often **decreases step throughput** (more synchronization) while **increasing sample throughput** (larger global batch).

---

## Single-node scaling: batch size sweep (8× A100)

Table below summarizes a single-node sweep varying the per-GPU batch size. We omit derived “24-hour” projections and focus on directly measured quantities and the implied sample throughput.

| Per-GPU batch | Global batch | Time / 100K steps | Seconds / step | Samples / s | GPU util |
|---:|---:|---:|---:|---:|---:|
| 2  | 16  | 19:32:17 | 0.703 | 22.7  | 74% |
| 4  | 32  | 24:35:59 | 0.886 | 36.1  | 89% |
| 8  | 64  | 31:25:38 | 1.131 | 56.6  | 92% |
| 16 | 128 | 49:15:53 | 1.774 | 72.2  | 91% |
| 24 | 192 | 66:47:02 | 2.404 | 79.9  | 96% |

Figure: step latency vs. per-GPU batch (left), and sample throughput with GPU utilization (right).

<img width="2300" height="800" alt="Image" src="https://github.com/user-attachments/assets/9f20c61c-2c03-4202-964e-8c2221f3e793" />

**Interpretation:** Smaller per-GPU batches yield faster steps, while larger per-GPU batches improve sample throughput up to a point, then suffer from sharply increased step latency (reflecting increased per-step computation and reduced kernel efficiency).

---

## Multi-node scaling: strong scaling vs. throughput scaling

Here we fix per-GPU batch size to **8** and scale the number of GPUs.

| # GPUs | Global batch | Time / 100K steps | Seconds / step | Samples / s | Scaling eff. |
|---:|---:|---:|---:|---:|---:|
| 8   | 64   | 20:25:48 | 0.735 | 87.0   | 100% |
| 16  | 128  | 23:36:00 | 0.850 | 150.7  | 86.7% |
| 32  | 256  | 24:58:45 | 0.899 | 284.7  | 81.9% |
| 64  | 512  | 25:40:59 | 0.925 | 553.8  | 79.6% |
| 128 | 1024 | 25:35:26 | 0.921 | 1111.5 | 79.9% |
| 256 | 2048 | 25:51:41 | 0.931 | 2200.0 | 79.1% |

Figure: step latency (left) and sample throughput (right), with an ideal linear reference from the 8-GPU baseline.


<img width="2300" height="800" alt="Image" src="https://github.com/user-attachments/assets/3aa989a7-a278-40d8-b414-58b28dc43b9c" />

**Interpretation:** Time per step increases mildly with more GPUs (non-trivial synchronization and scheduling overhead). However, sample throughput scales strongly with global batch, which is the relevant metric when the objective is to process a fixed amount of data quickly.


---

## Note on cross-report throughput comparisons

Throughput numbers reported across different projects/papers can differ substantially due to **hardware and training configuration**, not only due to algorithmic or implementation differences.

Common sources of variation include:

- **GPU type and system scale** (e.g., A100 vs. H100, GPU count, interconnect)
- **Batching** (per-GPU batch and global batch), which directly changes samples/s and can also impact seconds/step via kernel efficiency
- **Other settings** (precision, parallelism/ZeRO stage, sequence length, image resolution, gradient checkpointing)

For fair comparisons, we recommend reporting (or matching) hardware + global batch, and interpreting cross-report throughput numbers with care.
---

## Key takeaways

- Increasing GPU count tends to **reduce step throughput** due to synchronization overhead, but **sample throughput can still scale strongly** via larger global batch.
- On a single node, very large per-GPU batches may maximize utilization yet **degrade step latency sharply**; a moderate per-GPU batch often provides a better balance.
- For large-scale training, reporting both **step-level** and **sample-level** throughput is necessary to avoid misleading conclusions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StarVLA Training Efficiency Report #158

StarVLA Training Efficiency Report

Motivation

Experimental setup

Metrics

Single-node scaling: batch size sweep (8× A100)

Multi-node scaling: strong scaling vs. throughput scaling

Note on cross-report throughput comparisons

For fair comparisons, we recommend reporting (or matching) hardware + global batch, and interpreting cross-report throughput numbers with care.

Key takeaways

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Per-GPU batch	Global batch	Time / 100K steps	Seconds / step	Samples / s	GPU util
2	16	19:32:17	0.703	22.7	74%
4	32	24:35:59	0.886	36.1	89%
8	64	31:25:38	1.131	56.6	92%
16	128	49:15:53	1.774	72.2	91%
24	192	66:47:02	2.404	79.9	96%

# GPUs	Global batch	Time / 100K steps	Seconds / step	Samples / s	Scaling eff.
8	64	20:25:48	0.735	87.0	100%
16	128	23:36:00	0.850	150.7	86.7%
32	256	24:58:45	0.899	284.7	81.9%
64	512	25:40:59	0.925	553.8	79.6%
128	1024	25:35:26	0.921	1111.5	79.9%
256	2048	25:51:41	0.931	2200.0	79.1%

StarVLA Training Efficiency Report #158

Description

StarVLA Training Efficiency Report

Motivation

Experimental setup

Metrics

Single-node scaling: batch size sweep (8× A100)

Multi-node scaling: strong scaling vs. throughput scaling

Note on cross-report throughput comparisons

For fair comparisons, we recommend reporting (or matching) hardware + global batch, and interpreting cross-report throughput numbers with care.

Key takeaways

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions