StarVLA Training Efficiency Report
Motivation
We have noticed huge discrepancies between the training efficiency of StarVLA reported in the community feedback and our empirical observations. We report training efficiency measurements for StarVLA to provide actionable scaling guidance, focusing on common distributed-training bottlenecks (compute and communication).
Experimental setup
Unless otherwise specified:
- Model: StarVLA-GR00T
- Backbone: Qwen3-VL-4B
- Dataset: RoboCasa-GR1
- Hardware: NVIDIA A100 80GB
- Reported unit: wall-clock time per 100K optimization steps, including distributed communication and system overhead
Metrics
We distinguish two throughput notions:
- Step throughput: seconds/step (lower is better)
- Sample throughput: samples/s (higher is better)
Sample throughput is computed as:
$$
\text{samples/s} = \frac{\text{global batch}}{\text{seconds per step}}
$$
This distinction matters because distributed scaling often decreases step throughput (more synchronization) while increasing sample throughput (larger global batch).
Single-node scaling: batch size sweep (8× A100)
Table below summarizes a single-node sweep varying the per-GPU batch size. We omit derived “24-hour” projections and focus on directly measured quantities and the implied sample throughput.
| Per-GPU batch |
Global batch |
Time / 100K steps |
Seconds / step |
Samples / s |
GPU util |
| 2 |
16 |
19:32:17 |
0.703 |
22.7 |
74% |
| 4 |
32 |
24:35:59 |
0.886 |
36.1 |
89% |
| 8 |
64 |
31:25:38 |
1.131 |
56.6 |
92% |
| 16 |
128 |
49:15:53 |
1.774 |
72.2 |
91% |
| 24 |
192 |
66:47:02 |
2.404 |
79.9 |
96% |
Figure: step latency vs. per-GPU batch (left), and sample throughput with GPU utilization (right).
Interpretation: Smaller per-GPU batches yield faster steps, while larger per-GPU batches improve sample throughput up to a point, then suffer from sharply increased step latency (reflecting increased per-step computation and reduced kernel efficiency).
Multi-node scaling: strong scaling vs. throughput scaling
Here we fix per-GPU batch size to 8 and scale the number of GPUs.
| # GPUs |
Global batch |
Time / 100K steps |
Seconds / step |
Samples / s |
Scaling eff. |
| 8 |
64 |
20:25:48 |
0.735 |
87.0 |
100% |
| 16 |
128 |
23:36:00 |
0.850 |
150.7 |
86.7% |
| 32 |
256 |
24:58:45 |
0.899 |
284.7 |
81.9% |
| 64 |
512 |
25:40:59 |
0.925 |
553.8 |
79.6% |
| 128 |
1024 |
25:35:26 |
0.921 |
1111.5 |
79.9% |
| 256 |
2048 |
25:51:41 |
0.931 |
2200.0 |
79.1% |
Figure: step latency (left) and sample throughput (right), with an ideal linear reference from the 8-GPU baseline.
Interpretation: Time per step increases mildly with more GPUs (non-trivial synchronization and scheduling overhead). However, sample throughput scales strongly with global batch, which is the relevant metric when the objective is to process a fixed amount of data quickly.
Note on cross-report throughput comparisons
Throughput numbers reported across different projects/papers can differ substantially due to hardware and training configuration, not only due to algorithmic or implementation differences.
Common sources of variation include:
- GPU type and system scale (e.g., A100 vs. H100, GPU count, interconnect)
- Batching (per-GPU batch and global batch), which directly changes samples/s and can also impact seconds/step via kernel efficiency
- Other settings (precision, parallelism/ZeRO stage, sequence length, image resolution, gradient checkpointing)
For fair comparisons, we recommend reporting (or matching) hardware + global batch, and interpreting cross-report throughput numbers with care.
Key takeaways
- Increasing GPU count tends to reduce step throughput due to synchronization overhead, but sample throughput can still scale strongly via larger global batch.
- On a single node, very large per-GPU batches may maximize utilization yet degrade step latency sharply; a moderate per-GPU batch often provides a better balance.
- For large-scale training, reporting both step-level and sample-level throughput is necessary to avoid misleading conclusions.
StarVLA Training Efficiency Report
Motivation
We have noticed huge discrepancies between the training efficiency of StarVLA reported in the community feedback and our empirical observations. We report training efficiency measurements for StarVLA to provide actionable scaling guidance, focusing on common distributed-training bottlenecks (compute and communication).
Experimental setup
Unless otherwise specified:
Metrics
We distinguish two throughput notions:
Sample throughput is computed as:
This distinction matters because distributed scaling often decreases step throughput (more synchronization) while increasing sample throughput (larger global batch).
Single-node scaling: batch size sweep (8× A100)
Table below summarizes a single-node sweep varying the per-GPU batch size. We omit derived “24-hour” projections and focus on directly measured quantities and the implied sample throughput.
Figure: step latency vs. per-GPU batch (left), and sample throughput with GPU utilization (right).
Interpretation: Smaller per-GPU batches yield faster steps, while larger per-GPU batches improve sample throughput up to a point, then suffer from sharply increased step latency (reflecting increased per-step computation and reduced kernel efficiency).
Multi-node scaling: strong scaling vs. throughput scaling
Here we fix per-GPU batch size to 8 and scale the number of GPUs.
Figure: step latency (left) and sample throughput (right), with an ideal linear reference from the 8-GPU baseline.
Interpretation: Time per step increases mildly with more GPUs (non-trivial synchronization and scheduling overhead). However, sample throughput scales strongly with global batch, which is the relevant metric when the objective is to process a fixed amount of data quickly.
Note on cross-report throughput comparisons
Throughput numbers reported across different projects/papers can differ substantially due to hardware and training configuration, not only due to algorithmic or implementation differences.
Common sources of variation include:
For fair comparisons, we recommend reporting (or matching) hardware + global batch, and interpreting cross-report throughput numbers with care.
Key takeaways