Skip to content

StarVLA Training Efficiency Report #158

@JinhuiYE

Description

@JinhuiYE

StarVLA Training Efficiency Report

Motivation

We have noticed huge discrepancies between the training efficiency of StarVLA reported in the community feedback and our empirical observations. We report training efficiency measurements for StarVLA to provide actionable scaling guidance, focusing on common distributed-training bottlenecks (compute and communication).

Experimental setup

Unless otherwise specified:

  • Model: StarVLA-GR00T
  • Backbone: Qwen3-VL-4B
  • Dataset: RoboCasa-GR1
  • Hardware: NVIDIA A100 80GB
  • Reported unit: wall-clock time per 100K optimization steps, including distributed communication and system overhead

Metrics

We distinguish two throughput notions:

  1. Step throughput: seconds/step (lower is better)
  2. Sample throughput: samples/s (higher is better)

Sample throughput is computed as:

$$ \text{samples/s} = \frac{\text{global batch}}{\text{seconds per step}} $$

This distinction matters because distributed scaling often decreases step throughput (more synchronization) while increasing sample throughput (larger global batch).


Single-node scaling: batch size sweep (8× A100)

Table below summarizes a single-node sweep varying the per-GPU batch size. We omit derived “24-hour” projections and focus on directly measured quantities and the implied sample throughput.

Per-GPU batch Global batch Time / 100K steps Seconds / step Samples / s GPU util
2 16 19:32:17 0.703 22.7 74%
4 32 24:35:59 0.886 36.1 89%
8 64 31:25:38 1.131 56.6 92%
16 128 49:15:53 1.774 72.2 91%
24 192 66:47:02 2.404 79.9 96%

Figure: step latency vs. per-GPU batch (left), and sample throughput with GPU utilization (right).

Image

Interpretation: Smaller per-GPU batches yield faster steps, while larger per-GPU batches improve sample throughput up to a point, then suffer from sharply increased step latency (reflecting increased per-step computation and reduced kernel efficiency).


Multi-node scaling: strong scaling vs. throughput scaling

Here we fix per-GPU batch size to 8 and scale the number of GPUs.

# GPUs Global batch Time / 100K steps Seconds / step Samples / s Scaling eff.
8 64 20:25:48 0.735 87.0 100%
16 128 23:36:00 0.850 150.7 86.7%
32 256 24:58:45 0.899 284.7 81.9%
64 512 25:40:59 0.925 553.8 79.6%
128 1024 25:35:26 0.921 1111.5 79.9%
256 2048 25:51:41 0.931 2200.0 79.1%

Figure: step latency (left) and sample throughput (right), with an ideal linear reference from the 8-GPU baseline.

Image

Interpretation: Time per step increases mildly with more GPUs (non-trivial synchronization and scheduling overhead). However, sample throughput scales strongly with global batch, which is the relevant metric when the objective is to process a fixed amount of data quickly.


Note on cross-report throughput comparisons

Throughput numbers reported across different projects/papers can differ substantially due to hardware and training configuration, not only due to algorithmic or implementation differences.

Common sources of variation include:

  • GPU type and system scale (e.g., A100 vs. H100, GPU count, interconnect)
  • Batching (per-GPU batch and global batch), which directly changes samples/s and can also impact seconds/step via kernel efficiency
  • Other settings (precision, parallelism/ZeRO stage, sequence length, image resolution, gradient checkpointing)

For fair comparisons, we recommend reporting (or matching) hardware + global batch, and interpreting cross-report throughput numbers with care.

Key takeaways

  • Increasing GPU count tends to reduce step throughput due to synchronization overhead, but sample throughput can still scale strongly via larger global batch.
  • On a single node, very large per-GPU batches may maximize utilization yet degrade step latency sharply; a moderate per-GPU batch often provides a better balance.
  • For large-scale training, reporting both step-level and sample-level throughput is necessary to avoid misleading conclusions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions